# Hyperparameter Search using Trainer API ๐Ÿค— Transformers provides a `Trainer` class optimized for training ๐Ÿค— Transformers models, making it easier to start training without manually writing your own training loop. The `Trainer` provides API for hyperparameter search. This doc shows how to enable it in example. ## Hyperparameter Search backend `Trainer` supports four hyperparameter search backends currently: [optuna](https://optuna.org/), [sigopt](https://sigopt.com/), [raytune](https://docs.ray.io/en/latest/tune/index.html) and [wandb](https://wandb.ai/site/sweeps). you should install them before using them as the hyperparameter search backend ```bash pip install optuna/sigopt/wandb/ray[tune] ``` ## How to enable Hyperparameter search in example Define the hyperparameter search space, different backends need different format. For sigopt, see sigopt [object_parameter](https://docs.sigopt.com/ai-module-api-references/api_reference/objects/object_parameter), it's like following: ```py >>> def sigopt_hp_space(trial): ... return [ ... {"bounds": {"min": 1e-6, "max": 1e-4}, "name": "learning_rate", "type": "double"}, ... { ... "categorical_values": ["16", "32", "64", "128"], ... "name": "per_device_train_batch_size", ... "type": "categorical", ... }, ... ] ``` For optuna, see optuna [object_parameter](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#sphx-glr-tutorial-10-key-features-002-configurations-py), it's like following: ```py >>> def optuna_hp_space(trial): ... return { ... "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), ... "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]), ... } ``` Optuna provides multi-objective HPO. You can pass `direction` in `hyperparameter_search` and define your own compute_objective to return multiple objective values. The Pareto Front (`List[BestRun]`) will be returned in hyperparameter_search, you should refer to the test case `TrainerHyperParameterMultiObjectOptunaIntegrationTest` in [test_trainer](https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py). It's like following ```py >>> best_trials = trainer.hyperparameter_search( ... direction=["minimize", "maximize"], ... backend="optuna", ... hp_space=optuna_hp_space, ... n_trials=20, ... compute_objective=compute_objective, ... ) ``` For raytune, see raytune [object_parameter](https://docs.ray.io/en/latest/tune/api/search_space.html), it's like following: ```py >>> def ray_hp_space(trial): ... return { ... "learning_rate": tune.loguniform(1e-6, 1e-4), ... "per_device_train_batch_size": tune.choice([16, 32, 64, 128]), ... } ``` For wandb, see wandb [object_parameter](https://docs.wandb.ai/guides/sweeps/configuration), it's like following: ```py >>> def wandb_hp_space(trial): ... return { ... "method": "random", ... "metric": {"name": "objective", "goal": "minimize"}, ... "parameters": { ... "learning_rate": {"distribution": "uniform", "min": 1e-6, "max": 1e-4}, ... "per_device_train_batch_size": {"values": [16, 32, 64, 128]}, ... }, ... } ``` Define a `model_init` function and pass it to the `Trainer`, as an example: ```py >>> def model_init(trial): ... return AutoModelForSequenceClassification.from_pretrained( ... model_args.model_name_or_path, ... from_tf=bool(".ckpt" in model_args.model_name_or_path), ... config=config, ... cache_dir=model_args.cache_dir, ... revision=model_args.model_revision, ... token=True if model_args.use_auth_token else None, ... ) ``` Create a `Trainer` with your `model_init` function, training arguments, training and test datasets, and evaluation function: ```py >>> trainer = Trainer( ... model=None, ... args=training_args, ... train_dataset=small_train_dataset, ... eval_dataset=small_eval_dataset, ... compute_metrics=compute_metrics, ... processing_class=tokenizer, ... model_init=model_init, ... data_collator=data_collator, ... ) ``` Call hyperparameter search, get the best trial parameters, backend could be `"optuna"`/`"sigopt"`/`"wandb"`/`"ray"`. direction can be`"minimize"` or `"maximize"`, which indicates whether to optimize greater or lower objective. You could define your own compute_objective function, if not defined, the default compute_objective will be called, and the sum of eval metric like f1 is returned as objective value. ```py >>> best_trial = trainer.hyperparameter_search( ... direction="maximize", ... backend="optuna", ... hp_space=optuna_hp_space, ... n_trials=20, ... compute_objective=compute_objective, ... ) ``` ## Hyperparameter search For DDP finetune Currently, Hyperparameter search for DDP is enabled for optuna and sigopt. Only the rank-zero process will generate the search trial and pass the argument to other ranks. # Fully Sharded Data Parallel [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a data parallel method that shards a model's parameters, gradients and optimizer states across the number of available GPUs (also called workers or *rank*). Unlike [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), FSDP reduces memory-usage because a model is replicated on each GPU. This improves GPU memory-efficiency and allows you to train much larger models on fewer GPUs. FSDP is integrated with the Accelerate, a library for easily managing training in distributed environments, which means it is available for use from the `Trainer` class. Before you start, make sure Accelerate is installed and at least PyTorch 2.1.0 or newer. ```bash pip install accelerate ``` ## FSDP configuration To start, run the [`accelerate config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) command to create a configuration file for your training environment. Accelerate uses this configuration file to automatically setup the correct training environment based on your selected training options in `accelerate config`. ```bash accelerate config ``` When you run `accelerate config`, you'll be prompted with a series of options to configure your training environment. This section covers some of the most important FSDP options. To learn more about the other available FSDP options, take a look at the [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameters. ### Sharding strategy FSDP offers a number of sharding strategies to select from: * `FULL_SHARD` - shards model parameters, gradients and optimizer states across workers; select `1` for this option * `SHARD_GRAD_OP`- shard gradients and optimizer states across workers; select `2` for this option * `NO_SHARD` - don't shard anything (this is equivalent to DDP); select `3` for this option * `HYBRID_SHARD` - shard model parameters, gradients and optimizer states within each worker where each worker also has a full copy; select `4` for this option * `HYBRID_SHARD_ZERO2` - shard gradients and optimizer states within each worker where each worker also has a full copy; select `5` for this option This is enabled by the `fsdp_sharding_strategy` flag. ### CPU offload You could also offload parameters and gradients when they are not in use to the CPU to save even more GPU memory and help you fit large models where even FSDP may not be sufficient. This is enabled by setting `fsdp_offload_params: true` when running `accelerate config`. ### Wrapping policy FSDP is applied by wrapping each layer in the network. The wrapping is usually applied in a nested way where the full weights are discarded after each forward pass to save memory for use in the next layer. The *auto wrapping* policy is the simplest way to implement this and you don't need to change any code. You should select `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` to wrap a Transformer layer and `fsdp_transformer_layer_cls_to_wrap` to specify which layer to wrap (for example `BertLayer`). Otherwise, you can choose a size-based wrapping policy where FSDP is applied to a layer if it exceeds a certain number of parameters. This is enabled by setting `fsdp_wrap_policy: SIZE_BASED_WRAP` and `min_num_param` to the desired size threshold. ### Checkpointing Intermediate checkpoints should be saved with `fsdp_state_dict_type: SHARDED_STATE_DICT` because saving the full state dict with CPU offloading on rank 0 takes a lot of time and often results in `NCCL Timeout` errors due to indefinite hanging during broadcasting. You can resume training with the sharded state dicts with the [load_state](https://huggingface.co/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator.load_state)` method. ```py # directory containing checkpoints accelerator.load_state("ckpt") ``` However, when training ends, you want to save the full state dict because sharded state dict is only compatible with FSDP. ```py if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") trainer.save_model(script_args.output_dir) ``` ### TPU [PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html) supports FSDP training for TPUs and it can be enabled by modifying the FSDP configuration file generated by `accelerate config`. In addition to the sharding strategies and wrapping options specified above, you can add the parameters shown below to the file. ```yaml xla: True # must be set to True to enable PyTorch/XLA xla_fsdp_settings: # XLA-specific FSDP parameters xla_fsdp_grad_ckpt: True # use gradient checkpointing ``` The [`xla_fsdp_settings`](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128) allow you to configure additional XLA-specific parameters for FSDP. ## Launch training An example FSDP configuration file may look like: ```yaml compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: BertLayer fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ``` To launch training, run the [`accelerate launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) command and it'll automatically use the configuration file you previously created with `accelerate config`. ```bash accelerate launch my-trainer-script.py ``` ```bash accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/ my-trainer-script.py ``` ## Next steps FSDP can be a powerful tool for training really large models and you have access to more than one GPU or TPU. By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training. If you're interested in learning more, the following may be helpful: * Follow along with the more in-depth Accelerate guide for [FSDP](https://huggingface.co/docs/accelerate/usage_guides/fsdp). * Read the [Introducing PyTorch Fully Sharded Data Parallel (FSDP) API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) blog post. * Read the [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) blog post. # Perplexity of fixed-length models Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note that the metric applies specifically to classical language models (sometimes called autoregressive or causal language models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)). Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is, $$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{ When working with approximate models, however, we typically have a constraint on the number of tokens the model can process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we cannot calculate \\(p_\theta(x_t|x_{ This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will have less context at most of the prediction steps. Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly sliding the context window so that the model has more context when making each prediction. Sliding window PPL taking advantage of all available context This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by 1 token a time. This allows computation to proceed much faster while still giving the model a large context to make predictions at each step. ## Example: Calculating perplexity with GPT-2 in ๐Ÿค— Transformers Let's demonstrate this process with GPT-2. ```python from transformers import GPT2LMHeadModel, GPT2TokenizerFast device = "cuda" model_id = "openai-community/gpt2-large" model = GPT2LMHeadModel.from_pretrained(model_id).to(device) tokenizer = GPT2TokenizerFast.from_pretrained(model_id) ``` We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire dataset in memory. ```python from datasets import load_dataset test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt") ``` With ๐Ÿค— Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens available to condition on). ```python import torch from tqdm import tqdm max_length = model.config.n_positions stride = 512 seq_len = encodings.input_ids.size(1) nll_sum = 0.0 n_tokens = 0 prev_end_loc = 0 for begin_loc in tqdm(range(0, seq_len, stride)): end_loc = min(begin_loc + max_length, seq_len) trg_len = end_loc - prev_end_loc # may be different from stride on last loop input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) target_ids = input_ids.clone() target_ids[:, :-trg_len] = -100 with torch.no_grad(): outputs = model(input_ids, labels=target_ids) # loss is calculated using CrossEntropyLoss which averages over valid labels # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels # to the left by 1. neg_log_likelihood = outputs.loss # Accumulate the total negative log-likelihood and the total number of tokens num_valid_tokens = (target_ids != -100).sum().item() # number of valid tokens in target_ids batch_size = target_ids.size(0) num_loss_tokens = num_valid_tokens - batch_size # subtract batch_size due to internal label shift nll_sum += neg_log_likelihood * num_loss_tokens n_tokens += num_loss_tokens prev_end_loc = end_loc if end_loc == seq_len: break avg_nll = nll_sum / n_tokens # average negative log-likelihood per token ppl = torch.exp(avg_nll) ``` Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction, and the better the reported perplexity will typically be. When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window strategy, this jumps down to `16.44`. This is not only a more favorable score, but is calculated in a way that is closer to the true autoregressive decomposition of a sequence likelihood. # Efficient Training on Multiple CPUs When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently on [bare metal](#usage-in-trainer) and [Kubernetes](#usage-with-kubernetes). ## Intelยฎ oneCCL Bindings for PyTorch [Intelยฎ oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) and [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html). Module `oneccl_bindings_for_pytorch` (`torch_ccl` before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now Check more detailed information for [oneccl_bind_pt](https://github.com/intel/torch-ccl). ### Intelยฎ oneCCL Bindings for PyTorch installation Wheel files are available for the following Python versions: | Extension Version | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 | | :---------------: | :--------: | :--------: | :--------: | :---------: | :---------: | | 2.5.0 | | โˆš | โˆš | โˆš | โˆš | | 2.4.0 | | โˆš | โˆš | โˆš | โˆš | | 2.3.0 | | โˆš | โˆš | โˆš | โˆš | | 2.2.0 | | โˆš | โˆš | โˆš | โˆš | Please run `pip list | grep torch` to get your `pytorch_version`. ```bash pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu ``` where `{pytorch_version}` should be your PyTorch version, for instance 2.4.0. Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). Versions of oneCCL and PyTorch must match. ## Intelยฎ MPI library Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intelยฎ architecture. This component is part of the Intelยฎ oneAPI HPC Toolkit. oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it. ```bash oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh ``` #### Intelยฎ Extension for PyTorch installation Intel Extension for PyTorch (IPEX) provides performance optimizations for CPU training with both Float32 and BFloat16 (refer to the [single CPU section](./perf_train_cpu) to learn more). The following "Usage in Trainer" takes mpirun in Intelยฎ MPI library as an example. ## Usage in Trainer To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--ddp_backend ccl`** in the command arguments. Let's see an example with the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. ```shell script export CCL_WORKER_COUNT=1 export MASTER_ADDR=127.0.0.1 mpirun -n 2 -genv OMP_NUM_THREADS=23 \ python3 run_qa.py \ --model_name_or_path google-bert/bert-large-uncased \ --dataset_name squad \ --do_train \ --do_eval \ --per_device_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/ \ --no_cuda \ --ddp_backend ccl \ --use_ipex ``` The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. ```shell script cat hostfile xxx.xxx.xxx.xxx #node0 ip xxx.xxx.xxx.xxx #node1 ip ``` Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1 with BF16 auto mixed precision: ```shell script export CCL_WORKER_COUNT=1 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip mpirun -f hostfile -n 4 -ppn 2 \ -genv OMP_NUM_THREADS=23 \ python3 run_qa.py \ --model_name_or_path google-bert/bert-large-uncased \ --dataset_name squad \ --do_train \ --do_eval \ --per_device_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/ \ --no_cuda \ --ddp_backend ccl \ --use_ipex \ --bf16 ``` ## Usage with Kubernetes The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the [Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/user-guides/pytorch). ### Setup This example assumes that you have: * Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow) * [`kubectl`](https://kubernetes.io/docs/tasks/tools) installed and configured to access the Kubernetes cluster * A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes) that can be used to store datasets and model files. There are multiple options for setting up the PVC including using an NFS [storage class](https://kubernetes.io/docs/concepts/storage/storage-classes) or a cloud storage bucket. * A Docker container that includes your model training script and all the dependencies needed to run the script. For distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers. The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image: ```dockerfile FROM intel/intel-optimized-pytorch:2.4.0-pip-multinode RUN apt-get update -y && \ apt-get install -y --no-install-recommends --fix-missing \ google-perftools \ libomp-dev WORKDIR /workspace # Download and extract the transformers code ARG HF_TRANSFORMERS_VER="4.46.0" RUN pip install --no-cache-dir \ transformers==${HF_TRANSFORMERS_VER} && \ mkdir transformers && \ curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf - ``` The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the PyTorchJob to the cluster. ### PyTorchJob Specification File The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch) is used to run the distributed training job on the cluster. The yaml file for the PyTorchJob defines parameters such as: * The name of the PyTorchJob * The number of replicas (workers) * The python script and it's parameters that will be used to run the training job * The types of resources (node selector, memory, and CPU) needed for each worker * The image/tag for the Docker container to use * Environment variables * A volume mount for the PVC The volume mount defines a path where the PVC will be mounted in the container for each worker pod. This location can be used for the dataset, checkpoint files, and the saved model after training completes. The snippet below is an example of a yaml file for a PyTorchJob with 4 workers running the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). ```yaml apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: transformers-pytorchjob spec: elasticPolicy: rdzvBackend: c10d minReplicas: 1 maxReplicas: 4 maxRestarts: 10 pytorchReplicaSpecs: Worker: replicas: 4 # The number of worker pods restartPolicy: OnFailure template: spec: containers: - name: pytorch image: : # Specify the docker image to use for the worker pods imagePullPolicy: IfNotPresent command: ["/bin/bash", "-c"] args: - >- cd /workspace/transformers; pip install -r /workspace/transformers/examples/pytorch/question-answering/requirements.txt; source /usr/local/lib/python3.10/dist-packages/oneccl_bindings_for_pytorch/env/setvars.sh; torchrun /workspace/transformers/examples/pytorch/question-answering/run_qa.py \ --model_name_or_path distilbert/distilbert-base-uncased \ --dataset_name squad \ --do_train \ --do_eval \ --per_device_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/pvc-mount/output_$(date +%Y%m%d_%H%M%S) \ --no_cuda \ --ddp_backend ccl \ --bf16 \ --use_ipex; env: - name: LD_PRELOAD value: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.9:/usr/local/lib/libiomp5.so" - name: TRANSFORMERS_CACHE value: "/tmp/pvc-mount/transformers_cache" - name: HF_DATASETS_CACHE value: "/tmp/pvc-mount/hf_datasets_cache" - name: LOGLEVEL value: "INFO" - name: CCL_WORKER_COUNT value: "1" - name: OMP_NUM_THREADS # Can be tuned for optimal performance value: "240" resources: limits: cpu: 240 # Update the CPU and memory limit values based on your nodes memory: 128Gi requests: cpu: 240 # Update the CPU and memory request values based on your nodes memory: 128Gi volumeMounts: - name: pvc-volume mountPath: /tmp/pvc-mount - mountPath: /dev/shm name: dshm restartPolicy: Never nodeSelector: # Optionally use nodeSelector to match a certain node label for the worker pods node-type: gnr volumes: - name: pvc-volume persistentVolumeClaim: claimName: transformers-pvc - name: dshm emptyDir: medium: Memory ``` To run this example, update the yaml based on your training script and the nodes in your cluster. The CPU resource limits/requests in the yaml are defined in [cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu) where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed) [quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod) for the worker pods, set the same CPU and memory amounts for both the resource limits and requests. ### Deploy After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed to the cluster using: ```bash export NAMESPACE= kubectl create -f pytorchjob.yaml -n ${NAMESPACE} ``` The `kubectl get pods -n ${NAMESPACE}` command can then be used to list the pods in your namespace. You should see the worker pods for the PyTorchJob that was just deployed. At first, they will probably have a status of "Pending" as the containers get pulled and created, then the status should change to "Running". ``` NAME READY STATUS RESTARTS AGE ... transformers-pytorchjob-worker-0 1/1 Running 0 7m37s transformers-pytorchjob-worker-1 1/1 Running 0 7m37s transformers-pytorchjob-worker-2 1/1 Running 0 7m37s transformers-pytorchjob-worker-3 1/1 Running 0 7m37s ... ``` The logs for worker can be viewed using `kubectl logs -n ${NAMESPACE}`. Add `-f` to stream the logs, for example: ```bash kubectl logs transformers-pytorchjob-worker-0 -n ${NAMESPACE} -f ``` After the training job completes, the trained model can be copied from the PVC or storage location. When you are done with the job, the PyTorchJob resource can be deleted from the cluster using `kubectl delete -f pytorchjob.yaml -n ${NAMESPACE}`. ## Summary This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training performance, and can be used as a template to run your own workload on multiple nodes. # GPU inference GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, there are several optimizations you can use to speed up GPU inference. In this guide, you'll learn how to use FlashAttention-2 (a more memory-efficient attention mechanism), BetterTransformer (a PyTorch native fastpath execution), and bitsandbytes to quantize your model to a lower precision. Finally, learn how to use ๐Ÿค— Optimum to accelerate inference with ONNX Runtime on Nvidia and AMD GPUs. The majority of the optimizations described here also apply to multi-GPU setups! ## FlashAttention-2 FlashAttention-2 is experimental and may change considerably in future versions. [FlashAttention-2](https://huggingface.co/papers/2205.14135) is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inference by: 1. additionally parallelizing the attention computation over sequence length 2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them FlashAttention-2 is currently supported for the following architectures: * [Bark](https://huggingface.co/docs/transformers/model_doc/bark#transformers.BarkModel) * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) * [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon) * [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel) * [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel) * [GLM](https://huggingface.co/docs/transformers/model_doc/glm#transformers.GLMModel) * [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel) * [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel) * [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel) * [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model) * [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2) * [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel) * [GPTNeo](https://huggingface.co/docs/transformers/model_doc/gpt_neo#transformers.GPTNeoModel) * [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel) * [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel) * [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel) * [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel) * [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model) * [Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3#transformers.Idefics3Model) * [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel) * [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel) * [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel) * [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) * [Llava](https://huggingface.co/docs/transformers/model_doc/llava) * [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next) * [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video) * [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision) * [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi) * [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava) * [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava) * [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100) * [MBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel) * [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel) * [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel) * [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel) * [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel) * [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel) * [Nemotron](https://huggingface.co/docs/transformers/model_doc/nemotron) * [NLLB](https://huggingface.co/docs/transformers/model_doc/nllb) * [OLMo](https://huggingface.co/docs/transformers/model_doc/olmo#transformers.OlmoModel) * [OLMoE](https://huggingface.co/docs/transformers/model_doc/olmoe#transformers.OlmoeModel) * [OPT](https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTModel) * [PaliGemma](https://huggingface.co/docs/transformers/model_doc/paligemma#transformers.PaliGemmaForConditionalGeneration) * [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel) * [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model) * [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel) * [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel) * [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model) * [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model) * [Qwen2Audio](https://huggingface.co/docs/transformers/model_doc/qwen2_audio#transformers.Qwen2AudioEncoder) * [Qwen2MoE](https://huggingface.co/docs/transformers/model_doc/qwen2_moe#transformers.Qwen2MoeModel) * [Qwen2VL](https://huggingface.co/docs/transformers/model_doc/qwen2_vl#transformers.Qwen2VLModel) * [RAG](https://huggingface.co/docs/transformers/model_doc/rag#transformers.RagModel) * [SpeechEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/speech_encoder_decoder#transformers.SpeechEncoderDecoderModel) * [VisionEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/vision_encoder_decoder#transformers.VisionEncoderDecoderModel) * [VisionTextDualEncoder](https://huggingface.co/docs/transformers/model_doc/vision_text_dual_encoder#transformers.VisionTextDualEncoderModel) * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) * [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2Model) * [Hubert](https://huggingface.co/docs/transformers/model_doc/hubert#transformers.HubertModel) * [data2vec_audio](https://huggingface.co/docs/transformers/main/en/model_doc/data2vec#transformers.Data2VecAudioModel) * [Sew](https://huggingface.co/docs/transformers/main/en/model_doc/sew#transformers.SEWModel) * [SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip) * [UniSpeech](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/unispeech#transformers.UniSpeechModel) * [unispeech_sat](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/unispeech-sat#transformers.UniSpeechSatModel) You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request. Before you begin, make sure you have FlashAttention-2 installed. ```bash pip install flash-attn --no-build-isolation ``` We strongly suggest referring to the detailed [installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to learn more about supported hardware and data types! FlashAttention-2 is also supported on AMD GPUs and current support is limited to **Instinct MI210**, **Instinct MI250** and **Instinct MI300**. We strongly suggest using this [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs. To enable FlashAttention-2, pass the argument `attn_implementation="flash_attention_2"` to `from_pretrained()`: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM model_id = "tiiuae/falcon-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) ``` FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2.
You can also set `use_flash_attention_2=True` to enable FlashAttention-2 but it is deprecated in favor of `attn_implementation="flash_attention_2"`.
FlashAttention-2 can be combined with other optimization techniques like quantization to further speedup inference. For example, you can combine FlashAttention-2 with 8-bit or 4-bit quantization: ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM model_id = "tiiuae/falcon-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) # load in 8bit model = AutoModelForCausalLM.from_pretrained( model_id, load_in_8bit=True, attn_implementation="flash_attention_2", ) # load in 4bit model = AutoModelForCausalLM.from_pretrained( model_id, load_in_4bit=True, attn_implementation="flash_attention_2", ) ``` ### Expected speedups You can benefit from considerable speedups for inference, especially for inputs with long sequences. However, since FlashAttention-2 does not support computing attention scores with padding tokens, you must manually pad/unpad the attention scores for batched inference when the sequence contains padding tokens. This leads to a significant slowdown for batched generations with padding tokens. To overcome this, you should use FlashAttention-2 without padding tokens in the sequence during training (by packing a dataset or [concatenating sequences](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L516) until reaching the maximum sequence length). For a single forward pass on [tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b) with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is:
For a single forward pass on [meta-llama/Llama-7b-hf](https://hf.co/meta-llama/Llama-7b-hf) with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is:
For sequences with padding tokens (generating with padding tokens), you need to unpad/pad the input sequences to correctly compute the attention scores. With a relatively small sequence length, a single forward pass creates overhead leading to a small speedup (in the example below, 30% of the input is filled with padding tokens):
But for larger sequence lengths, you can expect even more speedup benefits: FlashAttention is more memory efficient, meaning you can train on much larger sequence lengths without running into out-of-memory issues. You can potentially reduce memory usage up to 20x for larger sequence lengths. Take a look at the [flash-attention](https://github.com/Dao-AILab/flash-attention) repository for more details.
## PyTorch scaled dot product attention PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available. You may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. For now, Transformers supports SDPA inference and training for the following architectures: * [Albert](https://huggingface.co/docs/transformers/model_doc/albert#transformers.AlbertModel) * [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTModel) * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) * [Bert](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel) * [BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt#transformers.BioGptModel) * [CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.CamembertModel) * [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon) * [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel) * [GLM](https://huggingface.co/docs/transformers/model_doc/glm#transformers.GLMModel) * [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel) * [data2vec_audio](https://huggingface.co/docs/transformers/main/en/model_doc/data2vec#transformers.Data2VecAudioModel) * [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel) * [DeiT](https://huggingface.co/docs/transformers/model_doc/deit#transformers.DeiTModel) * [Dinov2](https://huggingface.co/docs/transformers/en/model_doc/dinov2) * [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel) * [Dpr](https://huggingface.co/docs/transformers/model_doc/dpr#transformers.DprReader) * [EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder_decoder#transformers.EncoderDecoderModel) * [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel) * [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel) * [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model) * [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2) * [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel) * [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel) * [Hubert](https://huggingface.co/docs/transformers/model_doc/hubert#transformers.HubertModel) * [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel) * [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model) * [Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3#transformers.Idefics3Model) * [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel) * [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel) * [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel) * [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel) * [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) * [Llava](https://huggingface.co/docs/transformers/model_doc/llava) * [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next) * [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video) * [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision) * [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100#transformers.M2M100Model) * [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi) * [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel) * [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration) * [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel) * [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel) * [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel) * [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel) * [NLLB](https://huggingface.co/docs/transformers/model_doc/nllb) * [OLMo](https://huggingface.co/docs/transformers/model_doc/olmo#transformers.OlmoModel) * [OLMoE](https://huggingface.co/docs/transformers/model_doc/olmoe#transformers.OlmoeModel) * [OPT](https://huggingface.co/docs/transformers/en/model_doc/opt) * [PaliGemma](https://huggingface.co/docs/transformers/model_doc/paligemma#transformers.PaliGemmaForConditionalGeneration) * [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel) * [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model) * [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel) * [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel) * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) * [mBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel) * [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel) * [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel) * [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel) * [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model) * [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model) * [Qwen2Audio](https://huggingface.co/docs/transformers/model_doc/qwen2_audio#transformers.Qwen2AudioEncoder) * [Qwen2MoE](https://huggingface.co/docs/transformers/model_doc/qwen2_moe#transformers.Qwen2MoeModel) * [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaModel) * [Sew](https://huggingface.co/docs/transformers/main/en/model_doc/sew#transformers.SEWModel) * [SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip) * [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel) * [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model) * [UniSpeech](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/unispeech#transformers.UniSpeechModel) * [unispeech_sat](https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/unispeech-sat#transformers.UniSpeechSatModel) * [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaModel) * [Qwen2VL](https://huggingface.co/docs/transformers/model_doc/qwen2_vl#transformers.Qwen2VLModel) * [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel) * [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel) * [Nemotron](https://huggingface.co/docs/transformers/model_doc/nemotron) * [SpeechEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/speech_encoder_decoder#transformers.SpeechEncoderDecoderModel) * [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava) * [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava) * [VisionEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/vision_encoder_decoder#transformers.VisionEncoderDecoderModel) * [ViT](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTModel) * [ViTHybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid#transformers.ViTHybridModel) * [ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae#transformers.ViTMAEModel) * [ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn#transformers.ViTMSNModel) * [VisionTextDualEncoder](https://huggingface.co/docs/transformers/model_doc/vision_text_dual_encoder#transformers.VisionTextDualEncoderModel) * [VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae#transformers.VideoMAEModell) * [ViViT](https://huggingface.co/docs/transformers/model_doc/vivit#transformers.VivitModel) * [wav2vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2Model) * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) * [XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaModel) * [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl#transformers.XLMRobertaXLModel) * [YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos#transformers.YolosModel) FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first. The memory-efficient attention backend is able to handle `fp32` models. SDPA does not support certain sets of attention parameters, such as `head_mask` and `output_attentions=True`. In that case, you should see a warning message and we will fall back to the (slower) eager implementation. By default, SDPA selects the most performant kernel available but you can check whether a backend is available in a given setting (hardware, problem size) with [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager: ```diff import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda") input_text = "Hello my dog is cute and" inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` If you see a bug with the traceback below, try using the nightly version of PyTorch which may have broader coverage for FlashAttention: ```bash RuntimeError: No available kernel. Aborting execution. # install PyTorch nightly pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 ``` ## BetterTransformer Some BetterTransformer features are being upstreamed to Transformers with default support for native `torch.nn.scaled_dot_product_attention`. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers. Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of ๐Ÿค— decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post. BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. The two optimizations in the fastpath execution are: 1. fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps 2. skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors BetterTransformer also converts all attention operations to use the more memory-efficient [scaled dot product attention (SDPA)](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention), and it calls optimized kernels like [FlashAttention](https://huggingface.co/papers/2205.14135) under the hood. Before you start, make sure you have ๐Ÿค— Optimum [installed](https://huggingface.co/docs/optimum/installation). Then you can enable BetterTransformer with the `PreTrainedModel.to_bettertransformer()` method: ```python model = model.to_bettertransformer() ``` You can return the original Transformers model with the `reverse_bettertransformer()` method. You should use this before saving your model to use the canonical Transformers modeling: ```py model = model.reverse_bettertransformer() model.save_pretrained("saved_model") ``` ## bitsandbytes bitsandbytes is a quantization library that includes support for 4-bit and 8-bit quantization. Quantization reduces your model size compared to its native full precision version, making it easier to fit large models onto GPUs with limited memory. Make sure you have bitsandbytes and ๐Ÿค— Accelerate installed: ```bash # these versions support 8-bit and 4-bit pip install bitsandbytes>=0.39.0 accelerate>=0.20.0 # install Transformers pip install transformers ``` ### 4-bit To load a model in 4-bit for inference, use the `load_in_4bit` parameter. The `device_map` parameter is optional, but we recommend setting it to `"auto"` to allow ๐Ÿค— Accelerate to automatically and efficiently allocate the model given the available resources in the environment. ```py from transformers import AutoModelForCausalLM model_name = "bigscience/bloom-2b5" model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) ``` To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 600MB of memory to the first GPU and 1GB of memory to the second GPU: ```py max_memory_mapping = {0: "600MB", 1: "1GB"} model_name = "bigscience/bloom-3b" model_4bit = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping ) ``` ### 8-bit If you're curious and interested in learning more about the concepts underlying 8-bit quantization, read the [Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration) blog post. To load a model in 8-bit for inference, use the `load_in_8bit` parameter. The `device_map` parameter is optional, but we recommend setting it to `"auto"` to allow ๐Ÿค— Accelerate to automatically and efficiently allocate the model given the available resources in the environment: ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig model_name = "bigscience/bloom-2b5" model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True)) ``` If you're loading a model in 8-bit for text generation, you should use the `generate()` method instead of the `Pipeline` function which is not optimized for 8-bit models and will be slower. Some sampling strategies, like nucleus sampling, are also not supported by the `Pipeline` for 8-bit models. You should also place all inputs on the same device as the model: ```py from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig model_name = "bigscience/bloom-2b5" tokenizer = AutoTokenizer.from_pretrained(model_name) model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True)) prompt = "Hello, my llama is cute" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") generated_ids = model.generate(**inputs) outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) ``` To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU: ```py max_memory_mapping = {0: "1GB", 1: "2GB"} model_name = "bigscience/bloom-3b" model_8bit = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping ) ``` Feel free to try running a 11 billion parameter [T5 model](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) or the 3 billion parameter [BLOOM model](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) for inference on Google Colab's free tier GPUs! ## ๐Ÿค— Optimum Learn more details about using ORT with ๐Ÿค— Optimum in the [Accelerated inference on NVIDIA GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#accelerated-inference-on-nvidia-gpus) and [Accelerated inference on AMD GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu#accelerated-inference-on-amd-gpus) guides. This section only provides a brief and simple example. ONNX Runtime (ORT) is a model accelerator that supports accelerated inference on Nvidia GPUs, and AMD GPUs that use [ROCm](https://www.amd.com/en/products/software/rocm.html) stack. ORT uses optimization techniques like fusing common operations into a single node and constant folding to reduce the number of computations performed and speedup inference. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices. ORT is supported by ๐Ÿค— Optimum which can be used in ๐Ÿค— Transformers. You'll need to use an [ORTModel](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModel) for the task you're solving, and specify the `provider` parameter which can be set to either [`CUDAExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#cudaexecutionprovider), [`ROCMExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu) or [`TensorrtExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#tensorrtexecutionprovider). If you want to load a model that was not yet exported to ONNX, you can set `export=True` to convert your model on-the-fly to the ONNX format: ```py from optimum.onnxruntime import ORTModelForSequenceClassification ort_model = ORTModelForSequenceClassification.from_pretrained( "distilbert/distilbert-base-uncased-finetuned-sst-2-english", export=True, provider="CUDAExecutionProvider", ) ``` Now you're free to use the model for inference: ```py from optimum.pipelines import pipeline from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") result = pipeline("Both the music and visual were astounding, not to mention the actors performance.") ``` ## Combine optimizations It is often possible to combine several of the optimization techniques described above to get the best inference performance possible for your model. For example, you can load a model in 4-bit, and then enable BetterTransformer with FlashAttention: ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # load model in 4-bit quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config) # enable BetterTransformer model = model.to_bettertransformer() input_text = "Hello my dog is cute and" inputs = tokenizer(input_text, return_tensors="pt").to("cuda") # enable FlashAttention with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` # Distributed training with ๐Ÿค— Accelerate As models get bigger, parallelism has emerged as a strategy for training larger models on limited hardware and accelerating training speed by several orders of magnitude. At Hugging Face, we created the [๐Ÿค— Accelerate](https://huggingface.co/docs/accelerate) library to help users easily train a ๐Ÿค— Transformers model on any type of distributed setup, whether it is multiple GPU's on one machine or multiple GPU's across several machines. In this tutorial, learn how to customize your native PyTorch training loop to enable training in a distributed environment. ## Setup Get started by installing ๐Ÿค— Accelerate: ```bash pip install accelerate ``` Then import and create an [Accelerator](https://huggingface.co/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator) object. The [Accelerator](https://huggingface.co/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator) will automatically detect your type of distributed setup and initialize all the necessary components for training. You don't need to explicitly place your model on a device. ```py >>> from accelerate import Accelerator >>> accelerator = Accelerator() ``` ## Prepare to accelerate The next step is to pass all the relevant training objects to the [prepare](https://huggingface.co/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator.prepare) method. This includes your training and evaluation DataLoaders, a model and an optimizer: ```py >>> train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( ... train_dataloader, eval_dataloader, model, optimizer ... ) ``` ## Backward The last addition is to replace the typical `loss.backward()` in your training loop with ๐Ÿค— Accelerate's [backward](https://huggingface.co/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator.backward) method: ```py >>> for epoch in range(num_epochs): ... for batch in train_dataloader: ... outputs = model(**batch) ... loss = outputs.loss ... accelerator.backward(loss) ... optimizer.step() ... lr_scheduler.step() ... optimizer.zero_grad() ... progress_bar.update(1) ``` As you can see in the following code, you only need to add four additional lines of code to your training loop to enable distributed training! ```diff + from accelerate import Accelerator from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler + accelerator = Accelerator() model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) optimizer = AdamW(model.parameters(), lr=3e-5) - device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") - model.to(device) + train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( + train_dataloader, eval_dataloader, model, optimizer + ) num_epochs = 3 num_training_steps = num_epochs * len(train_dataloader) lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps ) progress_bar = tqdm(range(num_training_steps)) model.train() for epoch in range(num_epochs): for batch in train_dataloader: - batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss - loss.backward() + accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) ``` ## Train Once you've added the relevant lines of code, launch your training in a script or a notebook like Colaboratory. ### Train with a script If you are running your training from a script, run the following command to create and save a configuration file: ```bash accelerate config ``` Then launch your training with: ```bash accelerate launch train.py ``` ### Train with a notebook ๐Ÿค— Accelerate can also run in a notebook if you're planning on using Colaboratory's TPUs. Wrap all the code responsible for training in a function, and pass it to [notebook_launcher](https://huggingface.co/docs/accelerate/main/en/package_reference/launchers#accelerate.notebook_launcher): ```py >>> from accelerate import notebook_launcher >>> notebook_launcher(training_function) ``` For more information about ๐Ÿค— Accelerate and its rich features, refer to the [documentation](https://huggingface.co/docs/accelerate). # Best Practices for Generation with Cache Efficient caching is crucial for optimizing the performance of models in various generative tasks, including text generation, translation, summarization and other transformer-based applications. Effective caching helps reduce computation time and improve response rates, especially in real-time or resource-intensive applications. Transformers support various caching methods, leveraging "Cache" classes to abstract and manage the caching logic. This document outlines best practices for using these classes to maximize performance and efficiency. Check out all the available `Cache` classes in the [API documentation](./internal/generation_utils). ## What is Cache and why we should care? Imagine youโ€™re having a conversation with someone, and instead of remembering what was said previously, you have to start from scratch every time you respond. This would be slow and inefficient, right? In the world of Transformer models, a similar concept applies, and that's where Caching keys and values come into play. From now on, I'll refer to the concept as KV Cache. KV cache is needed to optimize the generation in autoregressive models, where the model predicts text token by token. This process can be slow since the model can generate only one token at a time, and each new prediction is dependent on the previous context. That means, to predict token number 1000 in the generation, you need information from the previous 999 tokens, which comes in the form of some matrix multiplications across the representations of those tokens. But to predict token number 1001, you also need the same information from the first 999 tokens, plus additional information from token number 1000. That is where key-value cache is used to optimize the sequential generation process by storing previous calculations to reuse in subsequent tokens, so they don't need to be computed again. More concretely, key-value cache acts as a memory bank for these generative models, where the model stores key-value pairs derived from self-attention layers for previously processed tokens. By storing this information, the model can avoid redundant computations and instead retrieve keys and values of previous tokens from the cache. Note that caching can be used only in inference and should be disabled when training, otherwise it might cause unexpected errors.
For the Curious Minds Who Like to Dive Deep ### Under the Hood: How Cache Object Works in Attention Mechanism When utilizing a cache object in the input, the Attention module performs several critical steps to integrate past and present information seamlessly. The Attention module concatenates the current key-values with the past key-values stored in the cache. This results in attention weights of shape `(new_tokens_length, past_kv_length + new_tokens_length)`. Essentially, the past and current key-values are combined to compute attention scores, ensuring that the model considers both previous context and new input. The concatenated key-values are used to compute the attention scores resulting in attention weights of shape `(new_tokens_length, past_kv_length + new_tokens_length)`. Therefore, when iteratively calling `forward()` instead of the `generate()` method, itโ€™s crucial to ensure that the attention mask shape matches the combined length of past and current key-values. The attention mask should have the shape `(batch_size, past_kv_length + new_tokens_length)`. This is usually handled internally when you call `generate()` method. If you want to implement your own generation loop with Cache classes, take this into consideration and prepare the attention mask to hold values to current and past tokens. One important concept you need to know when writing your own generation loop, is `cache_position`. In case you want to reuse an already filled Cache object by calling `forward()`, you have to pass in a valid `cache_position` which will indicate the positions of inputs in the sequence. Note that `cache_position` is not affected by padding, and always adds one more position for each token. For example, if key/value cache contains 10 tokens (no matter how many of it is a pad token), the cache position for the next token should be `torch.tensor([10])`. See an example below for how to implement your own generation loop. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache >>> model_id = "meta-llama/Llama-2-7b-chat-hf" >>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0") >>> tokenizer = AutoTokenizer.from_pretrained(model_id) >>> past_key_values = DynamicCache() >>> messages = [{"role": "user", "content": "Hello, what's your name."}] >>> inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0") >>> generated_ids = inputs.input_ids >>> cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.int64, device="cuda:0") >>> max_new_tokens = 10 >>> for _ in range(max_new_tokens): ... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True) ... # Greedily sample one next token ... next_token_ids = outputs.logits[:, -1:].argmax(-1) ... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1) ... ... # Prepare inputs for the next generation step by leaaving unprocessed tokens, in our case we have only one new token ... # and expanding attn mask for the new token, as explained above ... attention_mask = inputs["attention_mask"] ... attention_mask = torch.cat([attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1) ... inputs = {"input_ids": next_token_ids, "attention_mask": attention_mask} ... cache_position = cache_position[-1:] + 1 # add one more position for the next token >>> print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]) "[INST] Hello, what's your name. [/INST] Hello! My name is LLaMA," ```
## Generate with Cache In ๐Ÿค— Transformers, we support various Cache types to optimize the performance across different models and tasks. By default, all models generate with caching, with the `~DynamicCache` class being the default cache for most models. It allows us to dynamically grow cache size, by saving more and more keys and values as we generate. If for some reason you don't want to use caches, you can pass `use_cache=False` into the `generate()` method. Refer to the table below to see the difference between cache types and choose the one that suits best for your use-case. Models for which initialization is recommended should be initialized before calling the model and passed to model as a kwarg. In all other cases you can simply define desired `cache_implementation` and we take care of the rest for you. | Cache Type | Memory Efficient | Supports torch.compile() | Initialization Recommended | Latency | Long Context Generation | |------------------------|------------------|--------------------------|----------------------------|---------|-------------------------| | Dynamic Cache | No | No | No | Mid | No | | Static Cache | No | Yes | Yes | High | No | | Offloaded Cache | Yes | No | No | Low | Yes | | Offloaded Static Cache | No | Yes | Yes | High | Yes | | Quantized Cache | Yes | No | No | Low | Yes | | Sliding Window Cache | No | Yes | Yes | High | No | | Sink Cache | Yes | No | Yes | Mid | Yes | These cache classes can be set with a `cache_implementation` argument when generating. To learn about the available options for the cache_implementation flag, please refer to the [API Documentation](./main_classes/text_generation#transformers.GenerationConfig). Now, let's explore each cache type in detail and see how to use them. Note that the below examples are for decoder-only Tranformer-based models. We also support ["Model-Specific Cache"] classes for models such as Mamba or Jamba, keep reading for more details. ### Quantized Cache The key and value cache can occupy a large portion of memory, becoming a [bottleneck for long-context generation](https://huggingface.co/blog/llama31#inference-memory-requirements), especially for Large Language Models. Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed. KV Cache quantization in `transformers` is largely inspired by the paper ["KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache"](https://arxiv.org/abs/2402.02750) and currently supports `~QuantoQuantizedCache` and `~HQQQuantizedCache` classes. For more information on the inner workings see the paper. To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`. Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a `~QuantizedCacheConfig` class. One has to indicate which quantization backend to use in the `~QuantizedCacheConfig`, the default is `quanto`. It is recommended to set `axis-key/axis-value` parameters in the cache config to `0` if you're using the `quanto` backend and to `1` if you're using the `HQQ` backend. For other config values, please use the defaults unless you're running out of memory. In that case, you may consider decreasing the residual length. Cache quantization can be detrimental in terms of latency if the context length is short and there is enough GPU VRAM available to run without cache quantization. It is recommended to seek balance between memory efficiency and latency. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0") >>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device) >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"}) >>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) I like rock music because it's loud and energetic. It's a great way to express myself and rel >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20) >>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) I like rock music because it's loud and energetic. I like to listen to it when I'm feeling ``` ### Offloaded Cache Similarly to KV cache quantization, `~OffloadedCache` strategy aims to reduce GPU VRAM usage. It does so by moving the KV cache for most layers to the CPU. As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU. At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU. Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation. Thus, it can serve as a drop-in replacement or a fallback for it. Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.) you may notice a small degradation in generation throughput compared to the default KV cache implementation. To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config` or directly to the `generate()` call. Use `cache_implementation="offloaded_static"` for an offloaded static cache (see also [Offloaded Static Cache](#offloaded-static-cache) below). ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> ckpt = "microsoft/Phi-3-mini-4k-instruct" >>> tokenizer = AutoTokenizer.from_pretrained(ckpt) >>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0") >>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device) >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded") >>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896. >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23) >>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896. ``` Cache offloading requires a GPU and can be slower than dynamic KV cache. Use it if you are getting CUDA out of memory errors. The example below shows how KV cache offloading can be used as a fallback strategy. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> def resilient_generate(model, *args, **kwargs): ... oom = False ... try: ... return model.generate(*args, **kwargs) ... except torch.cuda.OutOfMemoryError as e: ... print(e) ... print("retrying with cache_implementation='offloaded'") ... oom = True ... if oom: ... torch.cuda.empty_cache() ... kwargs["cache_implementation"] = "offloaded" ... return model.generate(*args, **kwargs) ... ... >>> ckpt = "microsoft/Phi-3-mini-4k-instruct" >>> tokenizer = AutoTokenizer.from_pretrained(ckpt) >>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0") >>> prompt = ["okay "*1000 + "Fun fact: The most"] >>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device) >>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, } >>> out = resilient_generate(model, **inputs, **beams) >>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True) ``` On a GPU with 50 GB of RAM, running this code will print ``` CUDA out of memory. Tried to allocate 4.83 GiB. GPU retrying with cache_implementation='offloaded' ``` before successfully generating 40 beams. ### Static Cache Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The `~StaticCache` pre-allocates a specific maximum size for the keys and values, allowing you to generate up to the maximum length without having to modify cache size. Check the below usage example. For more examples with Static Cache and JIT compilation, take a look at [StaticCache & torchcompile](./llm_optims#static-kv-cache-and-torchcompile) ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto") >>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device) >>> # simply pass the cache implementation="static" >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static") >>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] "Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of" ``` ## Offloaded Static Cache Like `~OffloadedCache` exists for offloading a "DynamicCache", there is also an offloaded static cache. It fully supports JIT optimizations. Just pass `cache_implementation="offloaded_static"` in the `generation_config` or directly to the `generate()` call. This will use the `~OffloadedStaticCache` implementation instead. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto") >>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device) >>> # simply pass the cache implementation="static" >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="offloaded_static") >>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] "Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of" ``` ### Sliding Window Cache As the name suggests, this cache type implements a sliding window over previous keys and values, retaining only the last `sliding_window` tokens. It should be used with models like Mistral that support sliding window attention. Additionally, similar to Static Cache, this one is JIT-friendly and can be used with the same compile tecniques as Static Cache. Note that you can use this cache only for models that support sliding window, e.g. Mistral models. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") >>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16).to("cuda:0") >>> inputs = tokenizer("Yesterday I was on a rock concert and.", return_tensors="pt").to(model.device) >>> # can be used by passing in cache implementation >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, cache_implementation="sliding_window") >>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] "Yesterday I was on a rock concert and. I was so excited to see my favorite band. I was so excited that I was jumping up and down and screaming. I was so excited that I" ``` ### Sink Cache Sink Cache was introduced in ["Efficient Streaming Language Models with Attention Sinks"](https://arxiv.org/abs/2309.17453). It allows you to generate long sequences of text ("infinite length" according to the paper) without any fine-tuning. That is achieved by smart handling of previous keys and values, specifically it retains a few initial tokens from the sequence, called "sink tokens". This is based on the observation that these initial tokens attract a significant portion of attention scores during the generation process. Tokens that come after "sink tokens" are discarded on a sliding windowed basis, keeping only the latest `window_size` tokens. By keeping these initial tokens as "attention sinks," the model maintains stable performance even when dealing with very long texts, thus discarding most of the previous knowledge. Unlike other cache classes, this one can't be used directly by indicating a `cache_implementation`. You have to initialize the Cache before calling on `generate()` as follows. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0") >>> inputs = tokenizer("This is a long story about unicorns, fairies and magic.", return_tensors="pt").to(model.device) >>> # get our cache, specify number of sink tokens and window size >>> # Note that window size already includes sink tokens, so has to be larger >>> past_key_values = SinkCache(window_length=256, num_sink_tokens=4) >>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, past_key_values=past_key_values) >>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] "This is a long story about unicorns, fairies and magic. It is a fantasy world where unicorns and fairies live together in harmony. The story follows a young girl named Lily" ``` ### Encoder-Decoder Cache The `~EncoderDecoderCache` is a wrapper designed to handle the caching needs of encoder-decoder models. This cache type is specifically built to manage both self-attention and cross-attention caches, ensuring storage and retrieval of past key/values required for these complex models. Cool thing about Encoder-Decoder Cache is that you can set different cache types for the encoder and for the decoder, depending on your use case. Currently this cache is only supported in [Whisper](./model_doc/whisper) models but we will be adding more models soon. In terms of usage, there is nothing special to be done and calling `generate()` or `forward()` will handle everything for you. ### Model-specific Cache Classes Some models require storing previous keys, values, or states in a specific way, and the above cache classes cannot be used. For such cases, we have several specialized cache classes that are designed for specific models. These models only accept their own dedicated cache classes and do not support using any other cache types. Some examples include `~HybridCache` for [Gemma2](./model_doc/gemma2) series models or `~MambaCache` for [Mamba](./model_doc/mamba) architecture models. ## Iterative Generation with Cache We have seen how to use each of the cache types when generating. What if you want to use cache in iterative generation setting, for example in applications like chatbots, where interactions involve multiple turns and continuous back-and-forth exchanges. Iterative generation with cache allows these systems to handle ongoing conversations effectively without reprocessing the entire context at each step. But there are some tips that you should know before you start implementing: The general format when doing iterative generation is as below. First you have to initialize an empty cache of the type you want, and you can start feeding in new prompts iteratively. Keeping track of dialogues history and formatting can be done with chat templates, read more on that in [chat_templating](./chat_templating) In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length. ```python >>> import torch >>> from transformers import AutoTokenizer,AutoModelForCausalLM >>> from transformers.cache_utils import ( >>> DynamicCache, >>> SinkCache, >>> StaticCache, >>> SlidingWindowCache, >>> QuantoQuantizedCache, >>> QuantizedCacheConfig, >>> ) >>> model_id = "meta-llama/Llama-2-7b-chat-hf" >>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto') >>> tokenizer = AutoTokenizer.from_pretrained(model_id) >>> user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."] >>> past_key_values = DynamicCache() >>> max_cache_length = past_key_values.get_max_length() >>> messages = [] >>> for prompt in user_prompts: ... messages.append({"role": "user", "content": prompt}) ... inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device) ... if isinstance(past_key_values, SinkCache): ... inputs = {k: v[:, -max_cache_length:] for k, v in inputs.items()} ... ... input_length = inputs["input_ids"].shape[1] ... ... outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, past_key_values=past_key_values) ... completion = tokenizer.decode(outputs[0, input_length: ], skip_special_tokens=True) ... messages.append({"role": "assistant", "content": completion}) print(messages) [{'role': 'user', 'content': "Hello, what's your name?"}, {'role': 'assistant', 'content': " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. ๐Ÿ˜Š"}, {'role': 'user', 'content': 'Btw, yesterday I was on a rock concert.'}, {'role': 'assistant', 'content': ' Oh, cool! That sounds like a lot of fun! ๐ŸŽ‰ Did you enjoy the concert? What was the band like? ๐Ÿค”'}] ``` ## Re-use Cache to continue generation Sometimes you would want to first fill-in cache object with key/values for certain prefix prompt and re-use it several times to generate different sequences from it. In that case you can construct a `Cache` object that will hold the instruction prompt, and re-use it several times with different text sequences. ```python >>> import copy >>> import torch >>> from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache >>> model_id = "meta-llama/Llama-2-7b-chat-hf" >>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda") >>> tokenizer = AutoTokenizer.from_pretrained(model_id) >>> # Init StaticCache with big enough max-length (1024 tokens for the below example) >>> # You can also init a DynamicCache, if that suits you better >>> prompt_cache = StaticCache(config=model.config, max_batch_size=1, max_cache_len=1024, device="cuda", dtype=torch.bfloat16) >>> INITIAL_PROMPT = "You are a helpful assistant. " >>> inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda") >>> # This is the common prompt cached, we need to run forward without grad to be abel to copy >>> with torch.no_grad(): ... prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values >>> prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"] >>> responses = [] >>> for prompt in prompts: ... new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda") ... past_key_values = copy.deepcopy(prompt_cache) ... outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) ... response = tokenizer.batch_decode(outputs)[0] ... responses.append(response) >>> print(responses) [' You are a helpful assistant. Help me to write a blogpost about travelling.\n\nTitle: The Ultimate Guide to Travelling: Tips, Tricks, and', ' You are a helpful assistant. What is the capital of France?\n\nYes, the capital of France is Paris.'] ``` ## Legacy cache format Prior to the introduction of the `Cache` object, the cache of LLMs used to be a tuple of tuples of tensors. The legacy format has a dynamic size, growing as we generate text -- very similar to `DynamicCache`. If your project depend on this legacy format, you can seamlessly convert it to a `DynamicCache` and back. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto") >>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device) >>> # `return_dict_in_generate=True` is required to return the cache. `return_legacy_cache` forces the returned cache >>> # to be of the legacy type >>> generation_outputs = model.generate(**inputs, return_dict_in_generate=True, return_legacy_cache=True, max_new_tokens=5) >>> # We can convert a legacy cache to a DynamicCache -- and the other way around. This is helpful if you have custom >>> # logic to manipulate a cache in a specific format. >>> cache = DynamicCache.from_legacy_cache(generation_outputs.past_key_values) >>> legacy_format_cache = cache.to_legacy_cache() ``` # DeepSpeed [DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At its core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages: * ZeRO-1, optimizer state partitioning across GPUs * ZeRO-2, gradient partitioning across GPUs * ZeRO-3, parameter partitioning across GPUs In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers `Trainer` class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models. This guide will walk you through how to deploy DeepSpeed training, the features you can enable, how to setup the config files for different ZeRO stages, offloading, inference, and using DeepSpeed without the `Trainer`. ## Installation DeepSpeed is available to install from PyPI or Transformers (for more detailed installation options, take a look at the DeepSpeed [installation details](https://www.deepspeed.ai/tutorials/advanced-install/) or the GitHub [README](https://github.com/microsoft/deepspeed#installation)). If you're having difficulties installing DeepSpeed, check the [DeepSpeed CUDA installation](../debugging#deepspeed-cuda-installation) guide. While DeepSpeed has a pip installable PyPI package, it is highly recommended to [install it from source](https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source) to best match your hardware and to support certain features, like 1-bit Adam, which arenโ€™t available in the PyPI distribution. ```bash pip install deepspeed ``` ```bash pip install transformers[deepspeed] ``` ## Memory requirements Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. For example, to estimate the memory requirements for the [bigscience/T0_3B](bigscience/T0_3B) model on a single GPU: ```bash $ python -c 'from transformers import AutoModel; \ from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ model = AutoModel.from_pretrained("bigscience/T0_3B"); \ estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' [...] Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 1 GPU per node. SW: Model with 2783M total params, 65M largest layer params. per CPU | per GPU | Options 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 ``` This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations). You should also consider the tradeoff between cost and speed because it'll be cheaper to rent or buy a smaller GPU but it'll take longer to train your model. If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster. ## Select a ZeRO stage After you've installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. In order of fastest and most memory-efficient: | Fastest | Memory efficient | |------------------|------------------| | ZeRO-1 | ZeRO-3 + offload | | ZeRO-2 | ZeRO-3 | | ZeRO-2 + offload | ZeRO-2 + offload | | ZeRO-3 | ZeRO-2 | | ZeRO-3 + offload | ZeRO-1 | To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage. A general process you can use is (start with batch size of 1): 1. enable gradient checkpointing 2. try ZeRO-2 3. try ZeRO-2 and offload the optimizer 4. try ZeRO-3 5. try ZeRO-3 and offload parameters to the CPU 6. try ZeRO-3 and offload parameters and the optimizer to the CPU 7. try lowering various default values like a narrower search beam if you're using the `generate()` method 8. try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights 9. add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe 10. once you're not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency 11. lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage ## DeepSpeed configuration file DeepSpeed works with the `Trainer` class by way of a config file containing all the parameters for configuring how you want setup your training run. When you execute your training script, DeepSpeed logs the configuration it received from `Trainer` to the console so you can see exactly what configuration was used. Find a complete list of DeepSpeed configuration options on the [DeepSpeed Configuration JSON](https://www.deepspeed.ai/docs/config-json/) reference. You can also find more practical examples of various DeepSpeed configuration examples on the [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) repository or the main [DeepSpeed](https://github.com/microsoft/DeepSpeed) repository. To quickly find specific examples, you can: ```bash git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples find . -name '*json' # find examples with the Lamb optimizer grep -i Lamb $(find . -name '*json') ``` The DeepSpeed configuration file is passed as a path to a JSON file if you're training from the command line interface or as a nested `dict` object if you're using the `Trainer` in a notebook setting. ```py TrainingArguments(..., deepspeed="path/to/deepspeed_config.json") ``` ```py ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) args = TrainingArguments(..., deepspeed=ds_config_dict) trainer = Trainer(model, args, ...) ``` ### DeepSpeed and Trainer parameters There are three types of configuration parameters: 1. Some of the configuration parameters are shared by `Trainer` and DeepSpeed, and it can be difficult to identify errors when there are conflicting definitions. To make it easier, these shared configuration parameters are configured from the `Trainer` command line arguments. 2. Some configuration parameters that are automatically derived from the model configuration so you don't need to manually adjust these values. The `Trainer` uses a configuration value `auto` to determine set the most correct or efficient value. You could set your own configuration parameters explicitly, but you must take care to ensure the `Trainer` arguments and DeepSpeed configuration parameters agree. Mismatches may cause the training to fail in very difficult to detect ways! 3. Some configuration parameters specific to DeepSpeed only which need to be manually set based on your training needs. You could also modify the DeepSpeed configuration and edit `TrainingArguments` from it: 1. Create or load a DeepSpeed configuration to use as the main configuration 2. Create a `TrainingArguments` object based on these DeepSpeed configuration values Some values, such as `scheduler.params.total_num_steps` are calculated by the `Trainer` during training. ### ZeRO configuration There are three configurations, each corresponding to a different ZeRO stage. Stage 1 is not as interesting for scalability, and this guide focuses on stages 2 and 3. The `zero_optimization` configuration contains all the options for what to enable and how to configure them. For a more detailed explanation of each parameter, take a look at the [DeepSpeed Configuration JSON](https://www.deepspeed.ai/docs/config-json/) reference. DeepSpeed doesnโ€™t validate parameter names and any typos fallback on the parameter's default setting. You can watch the DeepSpeed engine startup log messages to see what values it is going to use. The following configurations must be setup with DeepSpeed because the `Trainer` doesn't provide equivalent command line arguments. ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed up. The ZeRO-1 config can be setup like this: ```yml { "zero_optimization": { "stage": 1 } } ``` ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since its features are not relevant to inference. Some important parameters to configure for better performance include: * `offload_optimizer` should be enabled to reduce GPU memory usage. * `overlap_comm` when set to `true` trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the `allgather_bucket_size` and `reduce_bucket_size` values. In this example, they're set to `5e8` which means it requires 9GB of GPU memory. If your GPU memory is 8GB or less, you should reduce `overlap_comm` to lower the memory requirements and prevent an out-of-memory (OOM) error. * `allgather_bucket_size` and `reduce_bucket_size` trade off available GPU memory for communication speed. The smaller their values, the slower communication is and the more GPU memory is available. You can balance, for example, whether a bigger batch size is more important than a slightly slower training time. * `round_robin_gradients` is available in DeepSpeed 0.4.4 for CPU offloading. It parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). ```yml { "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true "round_robin_gradients": true } } ``` ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. Unlike ZeRO-2, ZeRO-3 can also be used for inference, in addition to training, because it allows large models to be loaded on multiple GPUs. Some important parameters to configure include: * `device: "cpu"` can help if you're running out of GPU memory and if you have free CPU memory available. This allows offloading model parameters to the CPU. * `pin_memory: true` can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory. * `stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given time. Reduce this value if you encounter an OOM error. * `stage3_max_reuse_distance` is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than `stage3_max_reuse_distance`), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. But reduce this value if you encounter an OOM error. * `stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is expensive in terms of memory and speed. You should enable it if you're planning on resuming training. * `sub_group_size` controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of `sub_group_size` and each bucket is updated one at a time. When used with NVMe offload, `sub_group_size` determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. `sub_group_size` can be left to its default value if you aren't using NVMe offload, but you may want to change it if you: 1. Run into an OOM error during the optimizer step. In this case, reduce `sub_group_size` to reduce memory usage of the temporary buffers. 2. The optimizer step is taking a really long time. In this case, increase `sub_group_size` to improve bandwidth utilization as a result of increased data buffers. * `reduce_bucket_size`, `stage3_prefetch_bucket_size`, and `stage3_param_persistence_threshold` are dependent on a model's hidden size. It is recommended to set these values to `auto` and allow the `Trainer` to automatically assign the values. ```yml { "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } } ``` You can use the [`deepspeed.zero.Init`](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.Init) context manager to initialize a model faster: ```py from transformers import T5ForConditionalGeneration, T5Config import deepspeed with deepspeed.zero.Init(): config = T5Config.from_pretrained("google-t5/t5-small") model = T5ForConditionalGeneration(config) ``` For pretrained models, the DeepSped config file needs to have `is_deepspeed_zero3_enabled: true` setup in `TrainingArguments` and it needs a ZeRO configuration enabled. The `TrainingArguments` object must be created **before** calling the model `from_pretrained()`. ```py from transformers import AutoModel, Trainer, TrainingArguments training_args = TrainingArguments(..., deepspeed=ds_config) model = AutoModel.from_pretrained("google-t5/t5-small") trainer = Trainer(model=model, args=training_args, ...) ``` You'll need ZeRO-3 if the fp16 weights don't fit on a single GPU. If you're able to load fp16 weights, then make sure you specify `torch_dtype=torch.float16` in `from_pretrained()`. Another consideration for ZeRO-3 is if you have multiple GPUs, no single GPU has all the parameters unless it's the parameters for the currently executing layer. To access all parameters from all the layers at once, such as loading pretrained model weights in `from_pretrained()`, one layer is loaded at a time and immediately partitioned to all GPUs. This is because for very large models, it isn't possible to load the weights on one GPU and then distribute them across the other GPUs due to memory limitations. If you encounter a model parameter weight that looks like the following, where `tensor([1.])` or the parameter size is 1 instead of a larger multi-dimensional shape, this means the parameter is partitioned and this is a ZeRO-3 placeholder. ```py tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) ``` For more information about initializing large models with ZeRO-3 and accessing the parameters, take a look at the [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models) and [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#gathering-parameters) guides. ### NVMe configuration [ZeRO-Infinity](https://hf.co/papers/2104.07857) allows offloading model states to the CPU and/or NVMe to save even more memory. Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3. Depending on the CPU and/or NVMe memory available, you can offload both the [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading), just one of them, or none. You should also make sure the `nvme_path` is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it'll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. Lastly, [run a benchmark](https://github.com/microsoft/DeepSpeed/issues/998) on your training setup to determine the optimal `aio` configuration. The example ZeRO-3/Infinity configuration file below sets most of the parameter values to `auto`, but you could also manually add these values. ```yml { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 }, "aio": { "block_size": 262144, "queue_depth": 32, "thread_count": 1, "single_submit": false, "overlap_events": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ``` ## DeepSpeed features There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section. ### Activation/gradient checkpointing Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature: 1. For a Hugging Face model, set `model.gradient_checkpointing_enable()` or `--gradient_checkpointing` in the `Trainer`. 2. For a non-Hugging Face model, use the DeepSpeed [Activation Checkpointing API](https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html). You could also replace the Transformers modeling code and replace `torch.utils.checkpoint` with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them. ### Optimizer and scheduler DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don't enable `offload_optimizer`. When `offload_optimizer` is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation. The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. Aside from the optimizer and scheduler parameters, you'll need to ensure your `Trainer` command line arguments match the DeepSpeed configuration. DeepSpeed offers several [optimizers](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters) (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. If you don't configure the optimizer in the config, the `Trainer` automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: `lr`, `adam_beta1`, `adam_beta2`, `adam_epsilon`, `weight_decay`. You can set the parameters to `"auto"` or manually input your own desired values. ```yaml { "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } } } ``` You can also use an unsupported optimizer by adding the following to the top level configuration. ```yaml { "zero_allow_untested_optimizer": true } ``` From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer. ```yaml { "zero_force_ds_cpu_optimizer": false } ``` DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate [schedulers](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters). Transformers and DeepSpeed provide two of the same schedulers: * WarmupLR is the same as `--lr_scheduler_type constant_with_warmup` in Transformers * WarmupDecayLR is the same as `--lr_scheduler_type linear` in Transformers (this is the default scheduler used in Transformers) If you don't configure the scheduler in the config, the `Trainer` automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: `warmup_min_lr`, `warmup_max_lr`, `warmup_num_steps`, `total_num_steps` (automatically calculated during run time if `max_steps` is not provided). You can set the parameters to `"auto"` or manually input your own desired values. ```yaml { "scheduler": { "type": "WarmupDecayLR", "params": { "total_num_steps": "auto", "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } } } ``` ### Precision Deepspeed supports fp32, fp16, and bf16 mixed precision. If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode. ```yaml { "fp16": { "enabled": false } } ``` For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient [tf32](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices) format for some operations but the results are still in fp32. You can control it from the `Trainer` by setting `--tf32` to enable it, and `--tf32 0` or `--no_tf32` to disable it. To configure PyTorch AMP-like fp16 mixed precision reduces memory usage and accelerates training speed. `Trainer` automatically enables or disables fp16 based on the value of `args.fp16_backend`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend amp` or `--fp16_full_eval`. ```yaml { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 } } ``` For additional DeepSpeed fp16 training options, take a look at the [FP16 Training Options](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) reference. To configure Apex-like fp16 mixed precision, setup the config as shown below with `"auto"` or your own values. `Trainer` automatically configure `amp` based on the values of `args.fp16_backend` and `args.fp16_opt_level`. It can also be enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend apex` or `--fp16_opt_level 01`. ```yaml { "amp": { "enabled": "auto", "opt_level": "auto" } } ``` To use bf16, you'll need at least DeepSpeed==0.6.0. bf16 has the same dynamic range as fp32 and doesnโ€™t require loss scaling. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desired because this format's low precision can lead to lossy accumulation. bf16 can be setup in the config file or enabled from the command line when the following arguments are passed: `--bf16` or `--bf16_full_eval`. ```yaml { "bf16": { "enabled": "auto" } } ``` ### Batch size The batch size can be auto-configured or explicitly set. If you choose to use the `"auto"` option, `Trainer` sets `train_micro_batch_size_per_gpu` to the value of args.`per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`. ```yaml { "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto" } ``` ### Gradient accumulation Gradient accumulation can be auto-configured or explicitly set. If you choose to use the `"auto"` option, `Trainer` sets it to the value of `args.gradient_accumulation_steps`. ```yaml { "gradient_accumulation_steps": "auto" } ``` ### Gradient clipping Gradient clipping can be auto-configured or explicitly set. If you choose to use the `"auto"` option, `Trainer` sets it to the value of `args.max_grad_norm`. ```yaml { "gradient_clipping": "auto" } ``` ### Communication data type For communication collectives like reduction, gathering and scattering operations, a separate data type is used. All gather and scatter operations are performed in the same data type the data is in. For example, if you're training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation. Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn't exact. This is especially the case with bf16 which has a lower precision than fp16. For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients. You can choose the communication data type by setting the `communication_data_type` parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you're training in. ```yaml { "communication_data_type": "fp32" } ``` ## Deployment DeepSpeed can be deployed by different launchers such as [torchrun](https://pytorch.org/docs/stable/elastic/run.html), the `deepspeed` launcher, or [Accelerate](https://huggingface.co/docs/accelerate/basic_tutorials/launch#using-accelerate-launch). To deploy, add `--deepspeed ds_config.json` to the `Trainer` command line. Itโ€™s recommended to use DeepSpeedโ€™s [`add_config_arguments`](https://deepspeed.readthedocs.io/en/latest/initialize.html#argument-parsing) utility to add any necessary command line arguments to your code. This guide will show you how to deploy DeepSpeed with the `deepspeed` launcher for different training setups. You can check out this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400) for more practical usage examples. To deploy DeepSpeed on multiple GPUs, add the `--num_gpus` parameter. If you want to use all available GPUs, you don't need to add `--num_gpus`. The example below uses 2 GPUs. ```bash deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \ --deepspeed tests/deepspeed/ds_config_zero3.json \ --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir --fp16 \ --do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro ``` To deploy DeepSpeed on a single GPU, add the `--num_gpus` parameter. It isn't necessary to explicitly set this value if you only have 1 GPU because DeepSpeed deploys all GPUs it can see on a given node. ```bash deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ --deepspeed tests/deepspeed/ds_config_zero2.json \ --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir --fp16 \ --do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro ``` DeepSpeed is still useful with just 1 GPU because you can: 1. Offload some computations and memory to the CPU to make more GPU resources available to your model to use a larger batch size or fit a very large model that normally won't fit. 2. Minimize memory fragmentation with it's smart GPU memory management system which also allows you to fit bigger models and data batches. Set the `allgather_bucket_size` and `reduce_bucket_size` values to 2e8 in the [ZeRO-2](#zero-configuration) configuration file to get better performance on a single GPU. ### Multi-node deployment A node is one or more GPUs for running a workload. A more powerful setup is a multi-node setup which can be launched with the `deepspeed` launcher. For this guide, let's assume there are two nodes with 8 GPUs each. The first node can be accessed `ssh hostname1` and the second node with `ssh hostname2`. Both nodes must be able to communicate with each other locally over ssh without a password. By default, DeepSpeed expects your multi-node environment to use a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a [`checkpoint`](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) to allow loading without access to a shared filesystem: ```yaml { "checkpoint": { "use_node_local_storage": true } } ``` You could also use the `Trainer`'s `--save_on_each_node` argument to automatically add the above `checkpoint` to your config. For [torchrun](https://pytorch.org/docs/stable/elastic/run.html), you have to ssh to each node and run the following command on both of them. The launcher waits until both nodes are synchronized before launching the training. ```bash torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ --master_port=9901 your_program.py --deepspeed ds_config.json ``` For the `deepspeed` launcher, start by creating a `hostfile`. ```bash hostname1 slots=8 hostname2 slots=8 ``` Then you can launch the training with the following command. The `deepspeed` launcher automatically launches the command on both nodes at once. ```bash deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ your_program.py --deepspeed ds_config.json ``` Check out the [Resource Configuration (multi-node)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) guide for more details about configuring multi-node compute resources. ### SLURM In a SLURM environment, you'll need to adapt your SLURM script to your specific SLURM environment. An example SLURM script may look like: ```bash #SBATCH --job-name=test-nodes # name #SBATCH --nodes=2 # nodes #SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! #SBATCH --cpus-per-task=10 # number of cores per tasks #SBATCH --gres=gpu:8 # number of gpus #SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) #SBATCH --output=%x-%j.out # output file name export GPUS_PER_NODE=8 export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=9901 srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ your_program.py --deepspeed ds_config.json' ``` Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes. ```bash sbatch launch.slurm ``` ### Notebook The `deepspeed` launcher doesn't support deployment from a notebook so you'll need to emulate the distributed environment. However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. This means you have to use the `deepspeed` launcher which can't be emulated as shown here. ```py # DeepSpeed requires a distributed environment even when only one process is used. # This emulates a launcher in the notebook import os os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use os.environ["RANK"] = "0" os.environ["LOCAL_RANK"] = "0" os.environ["WORLD_SIZE"] = "1" # Now proceed as normal, plus pass the DeepSpeed config file training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") trainer = Trainer(...) trainer.train() ``` If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell. ```py %%bash cat <<'EOT' > ds_config_zero3.json { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } EOT ``` If the training script is in a file and not in a notebook cell, you can launch `deepspeed` normally from the shell in a notebook cell. For example, to launch `run_translation.py`: ```py !git clone https://github.com/huggingface/transformers !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... ``` You could also use `%%bash` magic and write multi-line code to run the shell program, but you won't be able to view the logs until training is complete. With `%%bash` magic, you don't need to emulate a distributed environment. ```py %%bash git clone https://github.com/huggingface/transformers cd transformers deepspeed examples/pytorch/translation/run_translation.py ... ``` ## Save model weights DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like `global_step*/*optim_states.pt`) and are saved under the normal checkpoint. A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. To save the model weights in fp16 for a model trained with ZeRO-3, you need to set `"stage3_gather_16bit_weights_on_model_save": true` because the model weights are partitioned across multiple GPUs. Otherwise, the `Trainer` won't save the weights in fp16 and it won't create a pytorch_model.bin file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights and you won't be able to load them. ```yaml { "zero_optimization": { "stage3_gather_16bit_weights_on_model_save": true } } ``` The full precision weights shouldn't be saved during training because it can require a lot of memory. It is usually best to save the fp32 weights offline after training is complete. But if you have a lot of free CPU memory, it is possible to save the fp32 weights during training. This section covers both online and offline approaches. ### Online You must have saved at least one checkpoint to load the latest checkpoint as shown in the following: ```py from transformers.trainer_utils import get_last_checkpoint from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) ``` If you've enabled the `--load_best_model_at_end` parameter to track the best checkpoint in `TrainingArguments`, you can finish training first and save the final model explicitly. Then you can reload it as shown below: ```py from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") trainer.deepspeed.save_checkpoint(checkpoint_dir) fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) ``` Once `load_state_dict_from_zero_checkpoint` is run, the model is no longer usable in DeepSpeed in the context of the same application. You'll need to initialize the DeepSpeed engine again since `model.load_state_dict(state_dict)` removes all the DeepSpeed magic from it. Only use this at the very end of training. You can also extract and load the state_dict of the fp32 weights: ```py from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu model = model.cpu() model.load_state_dict(state_dict) ``` ### Offline DeepSpeed provides a zero_to_fp32.py script at the top-level of the checkpoint folder for extracting weights at any point. This is a standalone script and you don't need a configuration file or `Trainer`. For example, if your checkpoint folder looked like this: ```bash $ ls -l output_dir/checkpoint-1/ -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ -rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin -rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json -rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* ``` To reconstruct the fp32 weights from the DeepSpeed checkpoint (ZeRO-2 or ZeRO-3) subfolder `global_step1`, run the following command to create and consolidate the full fp32 weights from multiple GPUs into a single pytorch_model.bin file. The script automatically discovers the subfolder containing the checkpoint. ```py python zero_to_fp32.py . pytorch_model.bin ``` Run `python zero_to_fp32.py -h` for more usage details. The script requires 2x the general RAM of the final fp32 weights. ## ZeRO Inference [ZeRO Inference](https://www.deepspeed.ai/2022/09/09/zero-inference.html) places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. Inference doesn't require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware. ZeRO Inference shares the same configuration file as [ZeRO-3](#zero-configuration), and ZeRO-2 and ZeRO-1 configs won't work because they don't provide any benefits for inference. To run ZeRO Inference, pass your usual training arguments to the `TrainingArguments` class and add the `--do_eval` argument. ```bash deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json ``` ## Non-Trainer DeepSpeed integration DeepSpeed also works with Transformers without the `Trainer` class. This is handled by the `HfDeepSpeedConfig` which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call `from_pretrained()`. If you want everything automatically taken care of for you, try using DeepSpeed with the `Trainer`! You'll need to follow the [DeepSpeed documentation](https://www.deepspeed.ai/), and manually configure the parameter values in the config file (you can't use the `"auto"` value). To efficiently deploy ZeRO-3, you must instantiate the `HfDeepSpeedConfig` object before the model and keep that object alive: ```py from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel import deepspeed ds_config = {...} # deepspeed config object or path to the file # must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive model = AutoModel.from_pretrained("openai-community/gpt2") engine = deepspeed.initialize(model=model, config_params=ds_config, ...) ``` `HfDeepSpeedConfig` is not required for ZeRO-1 or ZeRO-2. ```py from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel, AutoConfig import deepspeed ds_config = {...} # deepspeed config object or path to the file # must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive config = AutoConfig.from_pretrained("openai-community/gpt2") model = AutoModel.from_config(config) engine = deepspeed.initialize(model=model, config_params=ds_config, ...) ``` ### Non-Trainer ZeRO Inference To run ZeRO Inference without the `Trainer` in cases where you canโ€™t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel. Make sure to: * disable CPU offload if you have enough GPU memory (since it slows things down). * enable bf16 if you have an Ampere or newer GPU to make things faster. If you donโ€™t have one of these GPUs, you may enable fp16 as long as you donโ€™t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error. Take a look at the following script to get a better idea of how to run ZeRO Inference without the `Trainer` on a model that won't fit on a single GPU. ```py #!/usr/bin/env python # This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model # into a single GPU # # 1. Use 1 GPU with CPU offload # 2. Or use multiple GPUs instead # # First you need to install deepspeed: pip install deepspeed # # Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2 # small GPUs can handle it. or 1 small GPU and a lot of CPU memory. # # To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU - # you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to # process multiple inputs at once. # # The provided deepspeed config also activates CPU memory offloading, so chances are that if you # have a lot of available CPU memory and you don't mind a slowdown you should be able to load a # model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will # run faster if you don't want offload to CPU - so disable that section then. # # To deploy on 1 gpu: # # deepspeed --num_gpus 1 t0.py # or: # python -m torch.distributed.run --nproc_per_node=1 t0.py # # To deploy on 2 gpus: # # deepspeed --num_gpus 2 t0.py # or: # python -m torch.distributed.run --nproc_per_node=2 t0.py from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM from transformers.integrations import HfDeepSpeedConfig import deepspeed import os import torch os.environ["TOKENIZERS_PARALLELISM"] = "false" # To avoid warnings about parallelism in tokenizers # distributed setup local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) torch.cuda.set_device(local_rank) deepspeed.init_distributed() model_name = "bigscience/T0_3B" config = AutoConfig.from_pretrained(model_name) model_hidden_size = config.d_model # batch size has to be divisible by world_size, but can be bigger than world_size train_batch_size = 1 * world_size # ds_config notes # # - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be # faster. # # - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g. # all official t5 models are bf16-pretrained # # - set offload_param.device to "none" or completely remove the `offload_param` section if you don't # - want CPU offload # # - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control # - which params should remain on gpus - the larger the value the smaller the offload size # # For in-depth info on Deepspeed config see # https://huggingface.co/docs/transformers/main/main_classes/deepspeed # keeping the same format as json for consistency, except it uses lower case for true/false # fmt: off ds_config = { "fp16": { "enabled": False }, "bf16": { "enabled": False }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": True }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": model_hidden_size * model_hidden_size, "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size, "stage3_param_persistence_threshold": 10 * model_hidden_size }, "steps_per_print": 2000, "train_batch_size": train_batch_size, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": False } # fmt: on # next line instructs transformers to partition the model directly over multiple gpus using # deepspeed.zero.Init when model's `from_pretrained` method is called. # # **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)** # # otherwise the model will first be loaded normally and only partitioned at forward time which is # less efficient and when there is little CPU RAM may fail dschf = HfDeepSpeedConfig(ds_config) # keep this object alive # now a model can be loaded. model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # initialise Deepspeed ZeRO and store only the engine object ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0] ds_engine.module.eval() # inference # Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once. # If you use more GPUs adjust for more. # And of course if you have just one input to process you then need to pass the same string to both gpus # If you use only one GPU, then you will have only rank 0. rank = torch.distributed.get_rank() if rank == 0: text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy" elif rank == 1: text_in = "Is this review positive or negative? Review: this is the worst restaurant ever" tokenizer = AutoTokenizer.from_pretrained(model_name) inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank) with torch.no_grad(): outputs = ds_engine.module.generate(inputs, synced_gpus=True) text_out = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"rank{rank}:\n in={text_in}\n out={text_out}") ``` Save the script as t0.py and launch it: ```bash $ deepspeed --num_gpus 2 t0.py rank0: in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy out=Positive rank1: in=Is this review positive or negative? Review: this is the worst restaurant ever out=negative ``` This is a very basic example and you'll want to adapt it to your use case. ### Generate Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting `synced_gpus=True` in the `generate()` method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven't received the weight shard from the GPU that finished first. For Transformers>=4.28, if `synced_gpus` is automatically set to `True` if multiple GPUs are detected during generation. ## Troubleshoot When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn't (unless it's super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the [DeepSpeed repository](https://github.com/microsoft/DeepSpeed). For issues related to the Transformers integration, please provide the following information: * the full DeepSpeed config file * the command line arguments of the `Trainer`, or `TrainingArguments` arguments if you're scripting the `Trainer` setup yourself (don't dump the `TrainingArguments` which has dozens of irrelevant entries) * the outputs of: ```bash python -c 'import torch; print(f"torch: {torch.__version__}")' python -c 'import transformers; print(f"transformers: {transformers.__version__}")' python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' ``` * a link to a Google Colab notebook to reproduce the issue * if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with The following sections provide a guide for resolving two of the most common issues. ### DeepSpeed process killed at startup When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. In this case, check whether your configuration file has either `offload_optimizer`, `offload_param` or both configured to offload to the CPU. If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe ([estimate](https://deepspeed.readthedocs.io/en/latest/memory.html) the memory requirements for your model). ### NaN loss NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer). The other issue may be related to using fp16. For example, if this is your fp16 configuration: ```yaml { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 } } ``` You might see the following `OVERFLOW!` messages in the logs: ```bash 0%| | 0/189 [00:00 Some PyTorch operations are not implemented in MPS yet and will throw an error. To avoid this, you should set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU kernels instead (you'll still see a `UserWarning`).
If you run into any other errors, please open an issue in the [PyTorch](https://github.com/pytorch/pytorch/issues) repository because the `Trainer` only integrates the MPS backend. With the `mps` device set, you can: * train larger networks or batch sizes locally * reduce data retrieval latency because the GPU's unified memory architecture allows direct access to the full memory store * reduce costs because you don't need to train on cloud-based GPUs or add additional local GPUs Get started by making sure you have PyTorch installed. MPS acceleration is supported on macOS 12.3+. ```bash pip install torch torchvision torchaudio ``` `TrainingArguments` uses the `mps` device by default if it's available which means you don't need to explicitly set the device. For example, you can run the [run_glue.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) script with the MPS backend automatically enabled without making any changes. ```diff export TASK_NAME=mrpc python examples/pytorch/text-classification/run_glue.py \ --model_name_or_path google-bert/bert-base-cased \ --task_name $TASK_NAME \ - --use_mps_device \ --do_train \ --do_eval \ --max_seq_length 128 \ --per_device_train_batch_size 32 \ --learning_rate 2e-5 \ --num_train_epochs 3 \ --output_dir /tmp/$TASK_NAME/ \ --overwrite_output_dir ``` Backends for [distributed setups](https://pytorch.org/docs/stable/distributed.html#backends) like `gloo` and `nccl` are not supported by the `mps` device which means you can only train on a single GPU with the MPS backend. You can learn more about the MPS backend in the [Introducing Accelerated PyTorch Training on Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) blog post. # Share a model The last two tutorials showed how you can fine-tune a model with PyTorch, Keras, and ๐Ÿค— Accelerate for distributed setups. The next step is to share your model with the community! At Hugging Face, we believe in openly sharing knowledge and resources to democratize artificial intelligence for everyone. We encourage you to consider sharing your model with the community to help others save time and resources. In this tutorial, you will learn two methods for sharing a trained or fine-tuned model on the [Model Hub](https://huggingface.co/models): - Programmatically push your files to the Hub. - Drag-and-drop your files to the Hub with the web interface. To share a model with the community, you need an account on [huggingface.co](https://huggingface.co/join). You can also join an existing organization or create a new one. ## Repository features Each repository on the Model Hub behaves like a typical GitHub repository. Our repositories offer versioning, commit history, and the ability to visualize differences. The Model Hub's built-in versioning is based on git and [git-lfs](https://git-lfs.github.com/). In other words, you can treat one model as one repository, enabling greater access control and scalability. Version control allows *revisions*, a method for pinning a specific version of a model with a commit hash, tag or branch. As a result, you can load a specific model version with the `revision` parameter: ```py >>> model = AutoModel.from_pretrained( ... "julien-c/EsperBERTo-small", revision="4c77982" # tag name, or branch name, or commit hash ... ) ``` Files are also easily edited in a repository, and you can view the commit history as well as the differences: ![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png) ## Setup Before sharing a model to the Hub, you will need your Hugging Face credentials. If you have access to a terminal, run the following command in the virtual environment where ๐Ÿค— Transformers is installed. This will store your access token in your Hugging Face cache folder (`~/.cache/` by default): ```bash huggingface-cli login ``` If you are using a notebook like Jupyter or Colaboratory, make sure you have the [`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library) library installed. This library allows you to programmatically interact with the Hub. ```bash pip install huggingface_hub ``` Then use `notebook_login` to sign-in to the Hub, and follow the link [here](https://huggingface.co/settings/token) to generate a token to login with: ```py >>> from huggingface_hub import notebook_login >>> notebook_login() ``` ## Convert a model for all frameworks To ensure your model can be used by someone working with a different framework, we recommend you convert and upload your model with both PyTorch and TensorFlow checkpoints. While users are still able to load your model from a different framework if you skip this step, it will be slower because ๐Ÿค— Transformers will need to convert the checkpoint on-the-fly. Converting a checkpoint for another framework is easy. Make sure you have PyTorch and TensorFlow installed (see [here](installation) for installation instructions), and then find the specific model for your task in the other framework. Specify `from_tf=True` to convert a checkpoint from TensorFlow to PyTorch: ```py >>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) >>> pt_model.save_pretrained("path/to/awesome-name-you-picked") ``` ## Push a model during training Sharing a model to the Hub is as simple as adding an extra parameter or callback. Remember from the [fine-tuning tutorial](training), the `TrainingArguments` class is where you specify hyperparameters and additional training options. One of these training options includes the ability to push a model directly to the Hub. Set `push_to_hub=True` in your `TrainingArguments`: ```py >>> training_args = TrainingArguments(output_dir="my-awesome-model", push_to_hub=True) ``` Pass your training arguments as usual to `Trainer`: ```py >>> trainer = Trainer( ... model=model, ... args=training_args, ... train_dataset=small_train_dataset, ... eval_dataset=small_eval_dataset, ... compute_metrics=compute_metrics, ... ) ``` After you fine-tune your model, call `push_to_hub()` on `Trainer` to push the trained model to the Hub. ๐Ÿค— Transformers will even automatically add training hyperparameters, training results and framework versions to your model card! ```py >>> trainer.push_to_hub() ``` ## Use the `push_to_hub` function You can also call `push_to_hub` directly on your model to upload it to the Hub. Specify your model name in `push_to_hub`: ```py >>> pt_model.push_to_hub("my-awesome-model") ``` This creates a repository under your username with the model name `my-awesome-model`. Users can now load your model with the `from_pretrained` function: ```py >>> from transformers import AutoModel >>> model = AutoModel.from_pretrained("your_username/my-awesome-model") ``` If you belong to an organization and want to push your model under the organization name instead, just add it to the `repo_id`: ```py >>> pt_model.push_to_hub("my-awesome-org/my-awesome-model") ``` The `push_to_hub` function can also be used to add other files to a model repository. For example, add a tokenizer to a model repository: ```py >>> tokenizer.push_to_hub("my-awesome-model") ``` Or perhaps you'd like to add the TensorFlow version of your fine-tuned PyTorch model: ```py >>> tf_model.push_to_hub("my-awesome-model") ``` Now when you navigate to your Hugging Face profile, you should see your newly created model repository. Clicking on the **Files** tab will display all the files you've uploaded to the repository. For more details on how to create and upload files to a repository, refer to the Hub documentation [here](https://huggingface.co/docs/hub/how-to-upstream). ## Upload with the web interface Users who prefer a no-code approach are able to upload a model through the Hub's web interface. Visit [huggingface.co/new](https://huggingface.co/new) to create a new repository: ![new_model_repo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/new_model_repo.png) From here, add some information about your model: - Select the **owner** of the repository. This can be yourself or any of the organizations you belong to. - Pick a name for your model, which will also be the repository name. - Choose whether your model is public or private. - Specify the license usage for your model. Now click on the **Files** tab and click on the **Add file** button to upload a new file to your repository. Then drag-and-drop a file to upload and add a commit message. ![upload_file](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/upload_file.png) ## Add a model card To make sure users understand your model's capabilities, limitations, potential biases and ethical considerations, please add a model card to your repository. The model card is defined in the `README.md` file. You can add a model card by: * Manually creating and uploading a `README.md` file. * Clicking on the **Edit model card** button in your model repository. Take a look at the DistilBert [model card](https://huggingface.co/distilbert/distilbert-base-uncased) for a good example of the type of information a model card should include. For more details about other options you can control in the `README.md` file such as a model's carbon footprint or widget examples, refer to the documentation [here](https://huggingface.co/docs/hub/models-cards). # Installation Install ๐Ÿค— Transformers for whichever deep learning library you're working with, setup your cache, and optionally configure ๐Ÿค— Transformers to run offline. ๐Ÿค— Transformers is tested on Python 3.6+, PyTorch 1.1.0+, TensorFlow 2.0+, and Flax. Follow the installation instructions below for the deep learning library you are using: * [PyTorch](https://pytorch.org/get-started/locally/) installation instructions. * [TensorFlow 2.0](https://www.tensorflow.org/install/pip) installation instructions. * [Flax](https://flax.readthedocs.io/en/latest/) installation instructions. ## Install with pip You should install ๐Ÿค— Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). A virtual environment makes it easier to manage different projects, and avoid compatibility issues between dependencies. Start by creating a virtual environment in your project directory: ```bash python -m venv .env ``` Activate the virtual environment. On Linux and MacOs: ```bash source .env/bin/activate ``` Activate Virtual environment on Windows ```bash .env/Scripts/activate ``` Now you're ready to install ๐Ÿค— Transformers with the following command: ```bash pip install transformers ``` For CPU-support only, you can conveniently install ๐Ÿค— Transformers and a deep learning library in one line. For example, install ๐Ÿค— Transformers and PyTorch with: ```bash pip install 'transformers[torch]' ``` ๐Ÿค— Transformers and TensorFlow 2.0: ```bash pip install 'transformers[tf-cpu]' ``` M1 / ARM Users You will need to install the following before installing TensorFlow 2.0 ```bash brew install cmake brew install pkg-config ``` ๐Ÿค— Transformers and Flax: ```bash pip install 'transformers[flax]' ``` Finally, check if ๐Ÿค— Transformers has been properly installed by running the following command. It will download a pretrained model: ```bash python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))" ``` Then print out the label and score: ```bash [{'label': 'POSITIVE', 'score': 0.9998704791069031}] ``` ## Install from source Install ๐Ÿค— Transformers from source with the following command: ```bash pip install git+https://github.com/huggingface/transformers ``` This command installs the bleeding edge `main` version rather than the latest `stable` version. The `main` version is useful for staying up-to-date with the latest developments. For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. However, this means the `main` version may not always be stable. We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. If you run into a problem, please open an [Issue](https://github.com/huggingface/transformers/issues) so we can fix it even sooner! Check if ๐Ÿค— Transformers has been properly installed by running the following command: ```bash python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))" ``` ## Editable install You will need an editable install if you'd like to: * Use the `main` version of the source code. * Contribute to ๐Ÿค— Transformers and need to test changes in the code. Clone the repository and install ๐Ÿค— Transformers with the following commands: ```bash git clone https://github.com/huggingface/transformers.git cd transformers pip install -e . ``` These commands will link the folder you cloned the repository to and your Python library paths. Python will now look inside the folder you cloned to in addition to the normal library paths. For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python will also search the folder you cloned to: `~/transformers/`. You must keep the `transformers` folder if you want to keep using the library. Now you can easily update your clone to the latest version of ๐Ÿค— Transformers with the following command: ```bash cd ~/transformers/ git pull ``` Your Python environment will find the `main` version of ๐Ÿค— Transformers on the next run. ## Install with conda Install from the conda channel `conda-forge`: ```bash conda install conda-forge::transformers ``` ## Cache setup Pretrained models are downloaded and locally cached at: `~/.cache/huggingface/hub`. This is the default directory given by the shell environment variable `TRANSFORMERS_CACHE`. On Windows, the default directory is given by `C:\Users\username\.cache\huggingface\hub`. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory: 1. Shell environment variable (default): `HUGGINGFACE_HUB_CACHE` or `TRANSFORMERS_CACHE`. 2. Shell environment variable: `HF_HOME`. 3. Shell environment variable: `XDG_CACHE_HOME` + `/huggingface`. ๐Ÿค— Transformers will use the shell environment variables `PYTORCH_TRANSFORMERS_CACHE` or `PYTORCH_PRETRAINED_BERT_CACHE` if you are coming from an earlier iteration of this library and have set those environment variables, unless you specify the shell environment variable `TRANSFORMERS_CACHE`. ## Offline mode Run ๐Ÿค— Transformers in a firewalled or offline environment with locally cached files by setting the environment variable `HF_HUB_OFFLINE=1`. Add [๐Ÿค— Datasets](https://huggingface.co/docs/datasets/) to your offline training workflow with the environment variable `HF_DATASETS_OFFLINE=1`. ```bash HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \ python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` This script should run without hanging or waiting to timeout because it won't attempt to download the model from the Hub. You can also bypass loading a model from the Hub from each `from_pretrained()` call with the `local_files_only` parameter. When set to `True`, only local files are loaded: ```py from transformers import T5Model model = T5Model.from_pretrained("./path/to/local/directory", local_files_only=True) ``` ### Fetch models and tokenizers to use offline Another option for using ๐Ÿค— Transformers offline is to download the files ahead of time, and then point to their local path when you need to use them offline. There are three ways to do this: * Download a file through the user interface on the [Model Hub](https://huggingface.co/models) by clicking on the โ†“ icon. ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/download-icon.png) * Use the `PreTrainedModel.from_pretrained()` and `PreTrainedModel.save_pretrained()` workflow: 1. Download your files ahead of time with `PreTrainedModel.from_pretrained()`: ```py >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM >>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B") >>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B") ``` 2. Save your files to a specified directory with `PreTrainedModel.save_pretrained()`: ```py >>> tokenizer.save_pretrained("./your/path/bigscience_t0") >>> model.save_pretrained("./your/path/bigscience_t0") ``` 3. Now when you're offline, reload your files with `PreTrainedModel.from_pretrained()` from the specified directory: ```py >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0") >>> model = AutoModel.from_pretrained("./your/path/bigscience_t0") ``` * Programmatically download files with the [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) library: 1. Install the `huggingface_hub` library in your virtual environment: ```bash python -m pip install huggingface_hub ``` 2. Use the [`hf_hub_download`](https://huggingface.co/docs/hub/adding-a-library#download-files-from-the-hub) function to download a file to a specific path. For example, the following command downloads the `config.json` file from the [T0](https://huggingface.co/bigscience/T0_3B) model to your desired path: ```py >>> from huggingface_hub import hf_hub_download >>> hf_hub_download(repo_id="bigscience/T0_3B", filename="config.json", cache_dir="./your/path/bigscience_t0") ``` Once your file is downloaded and locally cached, specify it's local path to load and use it: ```py >>> from transformers import AutoConfig >>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json") ``` See the [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) section for more details on downloading files stored on the Hub. # Generation with LLMs LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model -- you need to do autoregressive generation. Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs. In ๐Ÿค— Transformers, this is handled by the `generate()` method, which is available to all models with generative capabilities. This tutorial will show you how to: * Generate text with an LLM * Avoid common pitfalls * Next steps to help you get the most out of your LLM Before you begin, make sure you have all the necessary libraries installed: ```bash pip install transformers bitsandbytes>=0.39.0 -q ``` ## Generate text A language model trained for [causal language modeling](tasks/language_modeling) takes a sequence of text tokens as input and returns the probability distribution for the next token.
"Forward pass of an LLM"
A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution.
"Autoregressive generation iteratively selects the next token from a probability distribution to generate text"
The process depicted above is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (`EOS`) token. If this is not the case, generation stops when some predefined maximum length is reached. Properly setting up the token selection step and the stopping condition is essential to make your model behave as you'd expect on your task. That is why we have a `GenerationConfig` file associated with each model, which contains a good default generative parameterization and is loaded alongside your model. Let's talk code! If you're interested in basic LLM usage, our high-level [`Pipeline`](pipeline_tutorial) interface is a great starting point. However, LLMs often require advanced features like quantization and fine control of the token selection step, which is best done through `generate()`. Autoregressive generation with LLMs is also resource-intensive and should be executed on a GPU for adequate throughput. First, you need to load the model. ```py >>> from transformers import AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained( ... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True ... ) ``` You'll notice two flags in the `from_pretrained` call: - `device_map` ensures the model is moved to your GPU(s) - `load_in_4bit` applies [4-bit dynamic quantization](main_classes/quantization) to massively reduce the resource requirements There are other ways to initialize a model, but this is a good baseline to begin with an LLM. Next, you need to preprocess your text input with a [tokenizer](tokenizer_summary). ```py >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") >>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda") ``` The `model_inputs` variable holds the tokenized text input, as well as the attention mask. While `generate()` does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results. After tokenizing the inputs, you can call the `generate()` method to returns the generated tokens. The generated tokens then should be converted to text before printing. ```py >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'A list of colors: red, blue, green, yellow, orange, purple, pink,' ``` Finally, you don't need to do it one sequence at a time! You can batch your inputs, which will greatly improve the throughput at a small latency and memory cost. All you need to do is to make sure you pad your inputs properly (more on that below). ```py >>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default >>> model_inputs = tokenizer( ... ["A list of colors: red, blue", "Portugal is"], return_tensors="pt", padding=True ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) ['A list of colors: red, blue, green, yellow, orange, purple, pink,', 'Portugal is a country in southwestern Europe, on the Iber'] ``` And that's it! In a few lines of code, you can harness the power of an LLM. ## Common pitfalls There are many [generation strategies](generation_strategies), and sometimes the default values may not be appropriate for your use case. If your outputs aren't aligned with what you're expecting, we've created a list of the most common pitfalls and how to avoid them. ```py >>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") >>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default >>> model = AutoModelForCausalLM.from_pretrained( ... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True ... ) ``` ### Generated output is too short/long If not specified in the `GenerationConfig` file, `generate` returns up to 20 tokens by default. We highly recommend manually setting `max_new_tokens` in your `generate` call to control the maximum number of new tokens it can return. Keep in mind LLMs (more precisely, [decoder-only models](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)) also return the input prompt as part of the output. ```py >>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") >>> # By default, the output will contain up to 20 tokens >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'A sequence of numbers: 1, 2, 3, 4, 5' >>> # Setting `max_new_tokens` allows you to control the maximum length >>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' ``` ### Incorrect generation mode By default, and unless specified in the `GenerationConfig` file, `generate` selects the most likely token at each iteration (greedy decoding). Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling. On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding. Enable sampling with `do_sample=True`, and you can learn more about this topic in this [blog post](https://huggingface.co/blog/how-to-generate). ```py >>> # Set seed for reproducibility -- you don't need this unless you want full reproducibility >>> from transformers import set_seed >>> set_seed(42) >>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") >>> # LLM + greedy decoding = repetitive, boring output >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'I am a cat. I am a cat. I am a cat. I am a cat' >>> # With sampling, the output becomes more creative! >>> generated_ids = model.generate(**model_inputs, do_sample=True) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'I am a cat. Specifically, I am an indoor-only cat. I' ``` ### Wrong padding side LLMs are [decoder-only](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt) architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded. Make sure you also don't forget to pass the attention mask to generate! ```py >>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, >>> # which is shorter, has padding on the right side. Generation fails to capture the logic. >>> model_inputs = tokenizer( ... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] '1, 2, 33333333333' >>> # With left-padding, it works as expected! >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") >>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default >>> model_inputs = tokenizer( ... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] '1, 2, 3, 4, 5, 6,' ``` ### Wrong prompt Some models and tasks expect a certain input prompt format to work properly. When this format is not applied, you will get a silent performance degradation: the model kinda works, but not as well as if you were following the expected prompt. More information about prompting, including which models and tasks need to be careful, is available in this [guide](tasks/prompting). Let's see an example with a chat LLM, which makes use of [chat templating](chat_templating): ```python >>> tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha") >>> model = AutoModelForCausalLM.from_pretrained( ... "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True ... ) >>> set_seed(0) >>> prompt = """How many helicopters can a human eat in one sitting? Reply as a thug.""" >>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") >>> input_length = model_inputs.input_ids.shape[1] >>> generated_ids = model.generate(**model_inputs, max_new_tokens=20) >>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) "I'm not a thug, but i can tell you that a human cannot eat" >>> # Oh no, it did not follow our instruction to reply as a thug! Let's see what happens when we write >>> # a better prompt and use the right template for this model (through `tokenizer.apply_chat_template`) >>> set_seed(0) >>> messages = [ ... { ... "role": "system", ... "content": "You are a friendly chatbot who always responds in the style of a thug", ... }, ... {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ... ] >>> model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") >>> input_length = model_inputs.shape[1] >>> generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=20) >>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) 'None, you thug. How bout you try to focus on more useful questions?' >>> # As we can see, it followed a proper thug style ๐Ÿ˜Ž ``` ## Further resources While the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts. For your next steps to help you dive deeper into LLM usage and understanding: ### Advanced generate usage 1. Guide on how to [control different generation methods](generation_strategies), how to set up the generation configuration file, and how to stream the output; 2. [Accelerating text generation](llm_optims); 3. [Prompt templates for chat LLMs](chat_templating); 4. [Prompt design guide](tasks/prompting); 5. API reference on `GenerationConfig`, `generate()`, and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples! ### LLM leaderboards 1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which focuses on the quality of the open-source models; 2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), which focuses on LLM throughput. ### Latency, throughput and memory utilization 1. Guide on how to [optimize LLMs for speed and memory](llm_tutorial_optimization); 2. Guide on [quantization](main_classes/quantization) such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements. ### Related libraries 1. [`optimum`](https://github.com/huggingface/optimum), an extension of ๐Ÿค— Transformers that optimizes for specific hardware devices. 2. [`outlines`](https://github.com/outlines-dev/outlines), a library where you can constrain text generation (e.g. to generate JSON files); 3. [`SynCode`](https://github.com/uiuc-focal-lab/syncode), a library for context-free grammar guided generation. (e.g. JSON, SQL, Python) 4. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs; 5. [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui), a UI for text generation; # Use tokenizers from ๐Ÿค— Tokenizers The `PreTrainedTokenizerFast` depends on the [๐Ÿค— Tokenizers](https://huggingface.co/docs/tokenizers) library. The tokenizers obtained from the ๐Ÿค— Tokenizers library can be loaded very simply into ๐Ÿค— Transformers. Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines: ```python >>> from tokenizers import Tokenizer >>> from tokenizers.models import BPE >>> from tokenizers.trainers import BpeTrainer >>> from tokenizers.pre_tokenizers import Whitespace >>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) >>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) >>> tokenizer.pre_tokenizer = Whitespace() >>> files = [...] >>> tokenizer.train(files, trainer) ``` We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to a JSON file for future re-use. ## Loading directly from the tokenizer object Let's see how to leverage this tokenizer object in the ๐Ÿค— Transformers library. The `PreTrainedTokenizerFast` class allows for easy instantiation, by accepting the instantiated *tokenizer* object as an argument: ```python >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) ``` This object can now be used with all the methods shared by the ๐Ÿค— Transformers tokenizers! Head to [the tokenizer page](main_classes/tokenizer) for more information. ## Loading from a JSON file In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer: ```python >>> tokenizer.save("tokenizer.json") ``` The path to which we saved this file can be passed to the `PreTrainedTokenizerFast` initialization method using the `tokenizer_file` parameter: ```python >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") ``` This object can now be used with all the methods shared by the ๐Ÿค— Transformers tokenizers! Head to [the tokenizer page](main_classes/tokenizer) for more information. # Efficient Training on Multiple GPUs If training a model on a single GPU is too slow or if the model's weights do not fit in a single GPU's memory, transitioning to a multi-GPU setup may be a viable option. Prior to making this transition, thoroughly explore all the strategies covered in the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) as they are universally applicable to model training on any number of GPUs. Once you have employed those strategies and found them insufficient for your case on a single GPU, consider moving to multiple GPUs. Transitioning from a single GPU to multiple GPUs requires the introduction of some form of parallelism, as the workload must be distributed across the resources. Multiple techniques can be employed to achieve parallelism, such as data parallelism, tensor parallelism, and pipeline parallelism. It's important to note that there isn't a one-size-fits-all solution, and the optimal settings depend on the specific hardware configuration you are using. This guide offers an in-depth overview of individual types of parallelism, as well as guidance on ways to combine techniques and choosing an appropriate approach. For step-by-step tutorials on distributed training, please refer to the [๐Ÿค— Accelerate documentation](https://huggingface.co/docs/accelerate/index). While the main concepts discussed in this guide are likely applicable across frameworks, here we focus on PyTorch-based implementations. Before diving deeper into the specifics of each technique, let's go over the rough decision process when training large models on a large infrastructure. ## Scalability strategy Begin by estimating how much vRAM is required to train your model. For models hosted on the ๐Ÿค— Hub, use our [Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), which gives you accurate calculations within a few percent margin. **Parallelization strategy for a single Node / multi-GPU setup** When training a model on a single node with multiple GPUs, your choice of parallelization strategy can significantly impact performance. Here's a breakdown of your options: **Case 1: Your model fits onto a single GPU** If your model can comfortably fit onto a single GPU, you have two primary options: 1. DDP - Distributed DataParallel 2. [Zero Redundancy Optimizer (ZeRO)](https://arxiv.org/abs/1910.02054) - depending on the situation and configuration used, this method may or may not be faster, however, it's worth experimenting with it. **Case 2: Your model doesn't fit onto a single GPU:** If your model is too large for a single GPU, you have several alternatives to consider: 1. PipelineParallel (PP) 2. [ZeRO](https://arxiv.org/abs/1910.02054) 3. [TensorParallel](#tensor-parallelism) (TP) With very fast inter-node connectivity (e.g., NVLINK or NVSwitch) all three strategies (PP, ZeRO, TP) should result in similar performance. However, without these, PP will be faster than TP or ZeRO. The degree of TP may also make a difference. It's best to experiment with your specific setup to determine the most suitable strategy. TP is almost always used within a single node. That is TP size <= GPUs per node. **Case 3: Largest layer of your model does not fit onto a single GPU** 1. If you are not using ZeRO, you have to use TensorParallel (TP), because PipelineParallel (PP) alone won't be sufficient to accommodate the large layer. 2. If you are using ZeRO, additionally adopt techniques from the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one). **Parallelization strategy for a multi-Node / multi-GPU setup** * When you have fast inter-node connectivity (e.g., NVLINK or NVSwitch) consider using one of these options: 1. ZeRO - as it requires close to no modifications to the model 2. A combination of PipelineParallel(PP) with TensorParallel(TP) and DataParallel(DP) - this approach will result in fewer communications, but requires significant changes to the model * When you have slow inter-node connectivity and still low on GPU memory: 1. Employ a combination of DataParallel(DP) with PipelineParallel(PP), TensorParallel(TP), and ZeRO. In the following sections of this guide we dig deeper into how these different parallelism methods work. ## Data Parallelism Even with only 2 GPUs, you can readily leverage the accelerated training capabilities offered by PyTorch's built-in features, such as `DataParallel` (DP) and `DistributedDataParallel` (DDP). Note that [PyTorch documentation](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html) recommends to prefer `DistributedDataParallel` (DDP) over `DataParallel` (DP) for multi-GPU training as it works for all models. Let's take a look at how these two methods work and what makes them different. ### DataParallel vs DistributedDataParallel To understand the key differences in inter-GPU communication overhead between the two methods, let's review the processes per batch: [DDP](https://pytorch.org/docs/master/notes/ddp.html): - At the start time the main process replicates the model once from GPU 0 to the rest of GPUs - Then for each batch: 1. Each GPU directly consumes its mini-batch of data. 2. During `backward`, once the local gradients are ready, they are averaged across all processes. [DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html): For each batch: 1. GPU 0 reads the batch of data and then sends a mini-batch to each GPU. 2. The up-to-date model is replicated from GPU 0 to each GPU. 3. `forward` is executed, and output from each GPU is sent to GPU 0 to compute the loss. 4. The loss is distributed from GPU 0 to all GPUs, and `backward` is run. 5. Gradients from each GPU are sent to GPU 0 and averaged. Key differences include: 1. DDP performs only a single communication per batch - sending gradients, while DP performs five different data exchanges per batch. DDP copies data using [torch.distributed](https://pytorch.org/docs/master/distributed.html), while DP copies data within the process via Python threads (which introduces limitations associated with GIL). As a result, **`DistributedDataParallel` (DDP) is generally faster than `DataParallel` (DP)** unless you have slow GPU card inter-connectivity. 2. Under DP, GPU 0 performs significantly more work than other GPUs, resulting in GPU under-utilization. 3. DDP supports distributed training across multiple machines, whereas DP does not. This is not an exhaustive list of differences between DP and DDP, however, other nuances are out of scope of this guide. You can get a deeper understanding of these methods by reading this [article](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/). Let's illustrate the differences between DP and DDP with an experiment. We'll benchmark the differences between DP and DDP with an added context of NVLink presence: * Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`). * Software: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0`. To disable the NVLink feature on one of the benchmarks, we use `NCCL_P2P_DISABLE=1`. Here is the benchmarking code and outputs: **DP** ```bash rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ python examples/pytorch/language-modeling/run_clm.py \ --model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} ``` **DDP w/ NVlink** ```bash rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ --model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} ``` **DDP w/o NVlink** ```bash rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ --model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} ``` Here are the same benchmarking results gathered in a table for convenience: | Type | NVlink | Time | | :----- | ----- | ---: | | 2:DP | Y | 110s | | 2:DDP | Y | 101s | | 2:DDP | N | 131s | As you can see, in this case DP is ~10% slower than DDP with NVlink, but ~15% faster than DDP without NVlink. The real difference will depend on how much data each GPU needs to sync with the others - the more there is to sync, the more a slow link will impede the overall runtime. ## ZeRO Data Parallelism ZeRO-powered data parallelism (ZeRO-DP) is illustrated in the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/).
DeepSpeed-Image-1
While it may appear complex, it is a very similar concept to `DataParallel` (DP). The difference is that instead of replicating the full model parameters, gradients and optimizer states, each GPU stores only a slice of it. Then, at run-time when the full layer parameters are needed just for the given layer, all GPUs synchronize to give each other parts that they miss. To illustrate this idea, consider a simple model with 3 layers (La, Lb, and Lc), where each layer has 3 parameters. Layer La, for example, has weights a0, a1 and a2: ``` La | Lb | Lc ---|----|--- a0 | b0 | c0 a1 | b1 | c1 a2 | b2 | c2 ``` If we have 3 GPUs, ZeRO-DP splits the model onto 3 GPUs like so: ``` GPU0: La | Lb | Lc ---|----|--- a0 | b0 | c0 GPU1: La | Lb | Lc ---|----|--- a1 | b1 | c1 GPU2: La | Lb | Lc ---|----|--- a2 | b2 | c2 ``` In a way, this is the same horizontal slicing as tensor parallelism, as opposed to Vertical slicing, where one puts whole layer-groups on different GPUs. Now let's see how this works: Each of these GPUs will get the usual mini-batch as it works in DP: ``` x0 => GPU0 x1 => GPU1 x2 => GPU2 ``` The inputs are passed without modifications as if they would be processed by the original model. First, the inputs get to the layer `La`. What happens at this point? On GPU0: the x0 mini-batch requires the a0, a1, a2 parameters to do its forward path through the layer, but the GPU0 has only a0. It will get a1 from GPU1 and a2 from GPU2, bringing all the pieces of the model together. In parallel, GPU1 gets another mini-batch - x1. GPU1 has the a1 parameter, but needs a0 and a2, so it gets those from GPU0 and GPU2. Same happens to GPU2 that gets the mini-batch x2. It gets a0 and a1 from GPU0 and GPU1. This way each of the 3 GPUs gets the full tensors reconstructed and makes a forward pass with its own mini-batch. As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation. The reconstruction is done efficiently via a pre-fetch. Then the whole process is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La. This mechanism is similar to an efficient group backpacking strategy: person A carries the tent, person B carries the stove, and person C carries the axe. Each night they all share what they have with others and get from others what they don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is what ZeRO DP/Sharded DDP is. Compare this strategy to the simple one where each person has to carry their own tent, stove and axe (similar to DataParallel (DP and DDP) in PyTorch), which would be far more inefficient. While reading the literature on this topic you may encounter the following synonyms: Sharded, Partitioned. If you pay close attention the way ZeRO partitions the model's weights - it looks very similar to tensor parallelism which will be discussed later. This is because it partitions/shards each layer's weights, unlike vertical model parallelism which is discussed next. Implementations: - [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/) ZeRO-DP stages 1+2+3 - [`Accelerate` integration](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) - [`transformers` integration](main_classes/trainer#trainer-integrations) ## From Naive Model Parallelism to Pipeline Parallelism To explain Pipeline parallelism, we'll first look into Naive Model Parallelism (MP), also known as Vertical MP. This approach involves distributing groups of model layers across multiple GPUs by assigning specific layers to specific GPUs with `.to()`. As data flows through these layers, it is moved to the same GPU as the layer, while the other layers remain untouched. We refer to this Model parallelism as "Vertical" because of how models are typically visualized. For example, the following diagram shows an 8-layer model split vertically into two slices, placing layers 0-3 onto GPU0 and 4-7 to GPU1: ``` ================ | Layer | | | 0 | | | 1 | GPU0 | | 2 | | | 3 | | ================ | Layer | | | 4 | | | 5 | GPU1 | | 6 | | | 7 | | ================ ``` In this example, when data moves from layer 0 to 3, it's no different from regular forward pass. However, passing data from layer 3 to 4 requires moving it from GPU0 to GPU1, introducing a communication overhead. If the participating GPUs are on the same compute node (e.g. same physical machine) this copying is fast, but if the GPUs are distributed across different compute nodes (e.g. multiple machines), the communication overhead could be substantially greater. Following that, layers 4 to 7 work as they would in the original model. Upon completion of the 7th layer, there is often a need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be computed and the optimizer can do its work. Naive Model Parallelism comes several shortcomings: - **All but one GPU are idle at any given moment**: if 4 GPUs are used, it's nearly identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. - **Overhead in data transfer between devices**: E.g. 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, but a single 24GB card will complete the training faster, because it doesn't have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states) - **Copying shared embeddings**: Shared embeddings may need to get copied back and forth between GPUs. Now that you are familiar with how the naive approach to model parallelism works and its shortcomings, let's look at Pipeline Parallelism (PP). PP is almost identical to a naive MP, but it solves the GPU idling problem by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. The following illustration from the [GPipe paper](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) shows the naive MP on the top, and PP on the bottom:
MP vs PP
At the bottom of the diagram, you can observe that the Pipeline Parallelism (PP) approach minimizes the number of idle GPU zones, referred to as 'bubbles'. Both parts of the diagram show a parallelism level of degree 4, meaning that 4 GPUs are involved in the pipeline. You can see that there's a forward path of 4 pipe stages (F0, F1, F2 and F3) followed by a backward path in reverse order (B3, B2, B1, and B0). PP introduces a new hyperparameter to tune - `chunks`, which determines how many data chunks are sent in a sequence through the same pipe stage. For example, in the bottom diagram you can see `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do complete their work. Only when the other GPUs begin to complete their work, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). Note that this is the same concept as gradient accumulation steps. PyTorch uses `chunks`, while DeepSpeed refers to the same hyperparameter as gradient accumulation steps. Because of the chunks, PP introduces the notion of micro-batches (MBS). DP splits the global data batch size into mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of 256 each (1024/4). And if the number of `chunks` (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each Pipeline stage works with a single micro-batch at a time. To calculate the global batch size of the DP + PP setup, use the formula: `mbs * chunks * dp_degree` (`8 * 32 * 4 = 1024`). With `chunks=1` you end up with the naive MP, which is inefficient. With a large `chunks` value you end up with tiny micro-batch sizes which is also inefficient. For this reason, we encourage to experiment with the `chunks` value to find the one that leads to the most efficient GPUs utilization. You may notice a bubble of "dead" time on the diagram that can't be parallelized because the last `forward` stage has to wait for `backward` to complete the pipeline. The purpose of finding the best value for `chunks` is to enable a high concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble. Pipeline API solutions have been implemented in: - PyTorch - DeepSpeed - Megatron-LM These come with some shortcomings: - They have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a `nn.Sequential` sequence of the same, which may require changes to the design of the model. - Currently the Pipeline API is very restricted. If you had a bunch of Python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693 - Conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage. - They have to arrange each layer so that the output of one layer becomes an input to the other layer. More recent solutions include: - Varuna - Sagemaker We have not experimented with Varuna and SageMaker but their papers report that they have overcome the list of problems mentioned above and that they require smaller changes to the user's model. Implementations: - [PyTorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py) - [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/) - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API. - [Varuna](https://github.com/microsoft/varuna) - [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. - [OSLO](https://github.com/tunib-ai/oslo) - this is implemented based on the Hugging Face Transformers. ๐Ÿค— Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support. The main obstacle is being unable to convert the models to `nn.Sequential` and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that. DeepSpeed and Megatron-LM integrations are available in [๐Ÿค— Accelerate](https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed) Other approaches: DeepSpeed, Varuna and SageMaker use the concept of an [Interleaved Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html)
Interleaved pipeline execution
Here the bubble (idle time) is further minimized by prioritizing backward passes. Varuna further attempts to improve the schedule by using simulations to discover the most efficient scheduling. OSLO has pipeline parallelism implementation based on the Transformers without `nn.Sequential` conversion. ## Tensor Parallelism In Tensor Parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it. To describe this method, this section of the guide relies on the concepts and diagrams from the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) paper: [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473). The main building block of any transformer is a fully connected `nn.Linear` followed by a nonlinear activation `GeLU`. The dot dot-product part of it, following the Megatron's paper notation, can be written as `Y = GeLU(XA)`, where `X` is an input vector, `Y` is the output vector, and `A` is the weight matrix. If we look at the computation in matrix form, you can see how the matrix multiplication can be split between multiple GPUs:
Parallel GEMM
If we split the weight matrix `A` column-wise across `N` GPUs and perform matrix multiplications `XA_1` through `XA_n` in parallel, then we will end up with `N` output vectors `Y_1, Y_2, ..., Y_n` which can be fed into `GeLU` independently:
Independent GeLU
Using this principle, we can update a multi-layer perceptron of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The Megatron-LM paper authors provide a helpful illustration for that:
Parallel shard processing
Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!
Parallel self-attention
Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs. This section is based on the original much more [detailed TP overview](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530). by [@anton-l](https://github.com/anton-l). Alternative names: - DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/training/#model-parallelism) Implementations: - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation, as it's very model-specific - [parallelformers](https://github.com/tunib-ai/parallelformers) (only inference at the moment) - [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. - [OSLO](https://github.com/tunib-ai/oslo) has the tensor parallelism implementation based on the Transformers. SageMaker combines TP with DP for a more efficient processing. ๐Ÿค— Transformers status: - core: not yet implemented in the core - but if you want inference [parallelformers](https://github.com/tunib-ai/parallelformers) provides this support for most of our models. So until this is implemented in the core you can use theirs. And hopefully training mode will be supported too. - Deepspeed-Inference also supports our BERT, GPT-2, and GPT-Neo models in their super-fast CUDA-kernel-based inference mode, see more [here](https://www.deepspeed.ai/tutorials/inference-tutorial/) ๐Ÿค— Accelerate integrates with [TP from Megatron-LM](https://huggingface.co/docs/accelerate/v0.23.0/en/usage_guides/megatron_lm). ## Data Parallelism + Pipeline Parallelism The following diagram from the DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/) demonstrates how one can combine DP with PP.
DP + PP-2d
Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. To DP there is just GPUs 0 and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. And GPU1 does the same by enlisting GPU3 to its aid. Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs. Implementations: - [DeepSpeed](https://github.com/microsoft/DeepSpeed) - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - [Varuna](https://github.com/microsoft/varuna) - [SageMaker](https://arxiv.org/abs/2111.05972) - [OSLO](https://github.com/tunib-ai/oslo) ๐Ÿค— Transformers status: not yet implemented ## Data Parallelism + Pipeline Parallelism + Tensor Parallelism To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. This can be seen in the following diagram.
dp-pp-tp-3d
This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/), which is a good read as well. Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs. Implementations: - [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP. - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - [Varuna](https://github.com/microsoft/varuna) - [SageMaker](https://arxiv.org/abs/2111.05972) - [OSLO](https://github.com/tunib-ai/oslo) ๐Ÿค— Transformers status: not yet implemented, since we have no PP and TP. ## ZeRO Data Parallelism + Pipeline Parallelism + Tensor Parallelism One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in [ZeRO Data Parallelism](#zero-data-parallelism). Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP. When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding). While it's theoretically possible to use ZeRO stage 2 (gradient sharding) with Pipeline Parallelism, it will have negative performance impacts. There would need to be an additional reduce-scatter collective for every micro-batch to aggregate the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with minimizing the Pipeline bubble (number of micro-batches). Therefore those communication costs are going to impact the performance. In addition, there are already fewer layers than normal due to PP and so the memory savings won't be huge. PP already reduces gradient size by ``1/PP``, and so gradient sharding savings on top of that are less significant than pure DP. ZeRO stage 3 is not a good choice either for the same reason - more inter-node communications required. And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU. Implementations: - [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo. - [OSLO](https://github.com/tunib-ai/oslo) Important papers: - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( https://arxiv.org/abs/2201.11990) ๐Ÿค— Transformers status: not yet implemented, since we have no PP and TP. ## FlexFlow [FlexFlow](https://github.com/flexflow/FlexFlow) also solves the parallelization problem in a slightly different approach. Paper: ["Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken](https://arxiv.org/abs/1807.05358) It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter. 1. Sample = Data Parallelism (sample-wise parallel) 2. Operator = Parallelize a single operation into several sub-operations 3. Attribute = Data Parallelism (length-wise parallel) 4. Parameter = Model Parallelism (regardless of dimension - horizontal or vertical) Examples: * Sample Let's take 10 batches of sequence length 512. If we parallelize them by sample dimension into 2 devices, we get 10 x 512 which becomes be 5 x 2 x 512. * Operator If we perform layer normalization, we compute std first and mean second, and then we can normalize data. Operator parallelism allows computing std and mean in parallel. So if we parallelize them by operator dimension into 2 devices (cuda:0, cuda:1), first we copy input data into both devices, and cuda:0 computes std, cuda:1 computes mean at the same time. * Attribute We have 10 batches of 512 length. If we parallelize them by attribute dimension into 2 devices, 10 x 512 will be 10 x 2 x 256. * Parameter It is similar with tensor model parallelism or naive layer-wise model parallelism.
flex-flow-soap
The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which parallelisation to use where. One very important aspect is that FlexFlow is designed for optimizing DNN parallelizations for models with static and fixed workloads, since models with dynamic behavior may prefer different parallelization strategies across iterations. So the promise is very attractive - it runs a 30min simulation on the cluster of choice and it comes up with the best strategy to utilise this specific environment. If you add/remove/replace any parts it'll run and re-optimize the plan for that. And then you can train. A different setup will have its own custom optimization. ๐Ÿค— Transformers status: Transformers models are FX-trace-able via [transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py), which is a prerequisite for FlexFlow, however, changes are required on the FlexFlow side to make it work with Transformers models. ## GPU selection When training on multiple GPUs, you can specify the number of GPUs to use and in what order. This can be useful for instance when you have GPUs with different computing power and want to use the faster GPU first. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) to use only a subset of the available GPUs, and you don't need Accelerate or the [DeepSpeed integration](./main_classes/deepspeed). ### Number of GPUs For example, if you have 4 GPUs and you only want to use the first 2: Use the `--nproc_per_node` to select how many GPUs to use. ```bash torchrun --nproc_per_node=2 trainer-program.py ... ``` Use `--num_processes` to select how many GPUs to use. ```bash accelerate launch --num_processes 2 trainer-program.py ... ``` Use `--num_gpus` to select how many GPUs to use. ```bash deepspeed --num_gpus 2 trainer-program.py ... ``` ### Order of GPUs Now, to select which GPUs to use and their order, you'll use the `CUDA_VISIBLE_DEVICES` environment variable. It is easiest to set the environment variable in a `~/bashrc` or another startup config file. `CUDA_VISIBLE_DEVICES` is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2: ```bash CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... ``` Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to `cuda:0` and `cuda:1` respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is `cuda:1` for GPU 0 and `cuda:0` for GPU 2. ```bash CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... ``` You can also set the `CUDA_VISIBLE_DEVICES` environment variable to an empty value to create an environment without GPUs. ```bash CUDA_VISIBLE_DEVICES= python trainer-program.py ... ``` As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line. `CUDA_DEVICE_ORDER` is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by: 1. PCIe bus ID's that matches the order of [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) and [`rocm-smi`](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/.doxygen/docBin/html/index.html) for NVIDIA and AMD GPUs respectively ```bash export CUDA_DEVICE_ORDER=PCI_BUS_ID ``` 2. GPU compute ability ```bash export CUDA_DEVICE_ORDER=FASTEST_FIRST ``` The `CUDA_DEVICE_ORDER` is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set `CUDA_DEVICE_ORDER=FASTEST_FIRST` to always use the newer and faster GPU first (`nvidia-smi` or `rocm-smi` still reports the GPUs in their PCIe order). Or you could also set `export CUDA_VISIBLE_DEVICES=1,0`. # Training on TPU with TensorFlow If you don't need long explanations and just want TPU code samples to get started with, check out [our TPU example notebook!](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) ### What is a TPU? A TPU is a **Tensor Processing Unit.** They are hardware designed by Google, which are used to greatly speed up the tensor computations within neural networks, much like GPUs. They can be used for both network training and inference. They are generally accessed through Googleโ€™s cloud services, but small TPUs can also be accessed directly for free through Google Colab and Kaggle Kernels. Because [all TensorFlow models in ๐Ÿค— Transformers are Keras models](https://huggingface.co/blog/tensorflow-philosophy), most of the methods in this document are generally applicable to TPU training for any Keras model! However, there are a few points that are specific to the HuggingFace ecosystem (hug-o-system?) of Transformers and Datasets, and weโ€™ll make sure to flag them up when we get to them. ### What kinds of TPU are available? New users are often very confused by the range of TPUs, and the different ways to access them. The first key distinction to understand is the difference between **TPU Nodes** and **TPU VMs.** When you use a **TPU Node**, you are effectively indirectly accessing a remote TPU. You will need a separate VM, which will initialize your network and data pipeline and then forward them to the remote node. When you use a TPU on Google Colab, you are accessing it in the **TPU Node** style. Using TPU Nodes can have some quite unexpected behaviour for people who arenโ€™t used to them! In particular, because the TPU is located on a physically different system to the machine youโ€™re running your Python code on, your data cannot be local to your machine - any data pipeline that loads from your machineโ€™s internal storage will totally fail! Instead, data must be stored in Google Cloud Storage where your data pipeline can still access it, even when the pipeline is running on the remote TPU node. If you can fit all your data in memory as `np.ndarray` or `tf.Tensor`, then you can `fit()` on that data even when using Colab or a TPU Node, without needing to upload it to Google Cloud Storage. **๐Ÿค—Specific Hugging Face Tip๐Ÿค—:** The methods `Dataset.to_tf_dataset()` and its higher-level wrapper `model.prepare_tf_dataset()` , which you will see throughout our TF code examples, will both fail on a TPU Node. The reason for this is that even though they create a `tf.data.Dataset` it is not a โ€œpureโ€ `tf.data` pipeline and uses `tf.numpy_function` or `Dataset.from_generator()` to stream data from the underlying HuggingFace `Dataset`. This HuggingFace `Dataset` is backed by data that is on a local disc and which the remote TPU Node will not be able to read. The second way to access a TPU is via a **TPU VM.** When using a TPU VM, you connect directly to the machine that the TPU is attached to, much like training on a GPU VM. TPU VMs are generally easier to work with, particularly when it comes to your data pipeline. All of the above warnings do not apply to TPU VMs! This is an opinionated document, so hereโ€™s our opinion: **Avoid using TPU Node if possible.** It is more confusing and more difficult to debug than TPU VMs. It is also likely to be unsupported in future - Googleโ€™s latest TPU, TPUv4, can only be accessed as a TPU VM, which suggests that TPU Nodes are increasingly going to become a โ€œlegacyโ€ access method. However, we understand that the only free TPU access is on Colab and Kaggle Kernels, which uses TPU Node - so weโ€™ll try to explain how to handle it if you have to! Check the [TPU example notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) for code samples that explain this in more detail. ### What sizes of TPU are available? A single TPU (a v2-8/v3-8/v4-8) runs 8 replicas. TPUs exist in **pods** that can run hundreds or thousands of replicas simultaneously. When you use more than a single TPU but less than a whole pod (for example, a v3-32), your TPU fleet is referred to as a **pod slice.** When you access a free TPU via Colab, you generally get a single v2-8 TPU. ### I keep hearing about this XLA thing. Whatโ€™s XLA, and how does it relate to TPUs? XLA is an optimizing compiler, used by both TensorFlow and JAX. In JAX it is the only compiler, whereas in TensorFlow it is optional (but mandatory on TPU!). The easiest way to enable it when training a Keras model is to pass the argument `jit_compile=True` to `model.compile()`. If you donโ€™t get any errors and performance is good, thatโ€™s a great sign that youโ€™re ready to move to TPU! Debugging on TPU is generally a bit harder than on CPU/GPU, so we recommend getting your code running on CPU/GPU with XLA first before trying it on TPU. You donโ€™t have to train for long, of course - just for a few steps to make sure that your model and data pipeline are working like you expect them to. XLA compiled code is usually faster - so even if youโ€™re not planning to run on TPU, adding `jit_compile=True` can improve your performance. Be sure to note the caveats below about XLA compatibility, though! **Tip born of painful experience:** Although using `jit_compile=True` is a good way to get a speed boost and test if your CPU/GPU code is XLA-compatible, it can actually cause a lot of problems if you leave it in when actually training on TPU. XLA compilation will happen implicitly on TPU, so remember to remove that line before actually running your code on a TPU! ### How do I make my model XLA compatible? In many cases, your code is probably XLA-compatible already! However, there are a few things that work in normal TensorFlow that donโ€™t work in XLA. Weโ€™ve distilled them into three core rules below: **๐Ÿค—Specific HuggingFace Tip๐Ÿค—:** Weโ€™ve put a lot of effort into rewriting our TensorFlow models and loss functions to be XLA-compatible. Our models and loss functions generally obey rule #1 and #2 by default, so you can skip over them if youโ€™re using `transformers` models. Donโ€™t forget about these rules when writing your own models and loss functions, though! #### XLA Rule #1: Your code cannot have โ€œdata-dependent conditionalsโ€ What that means is that any `if` statement cannot depend on values inside a `tf.Tensor`. For example, this code block cannot be compiled with XLA! ```python if tf.reduce_sum(tensor) > 10: tensor = tensor / 2.0 ``` This might seem very restrictive at first, but most neural net code doesnโ€™t need to do this. You can often get around this restriction by using `tf.cond` (see the documentation [here](https://www.tensorflow.org/api_docs/python/tf/cond)) or by removing the conditional and finding a clever math trick with indicator variables instead, like so: ```python sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32) tensor = tensor / (1.0 + sum_over_10) ``` This code has exactly the same effect as the code above, but by avoiding a conditional, we ensure it will compile with XLA without problems! #### XLA Rule #2: Your code cannot have โ€œdata-dependent shapesโ€ What this means is that the shape of all of the `tf.Tensor` objects in your code cannot depend on their values. For example, the function `tf.unique` cannot be compiled with XLA, because it returns a `tensor` containing one instance of each unique value in the input. The shape of this output will obviously be different depending on how repetitive the input `Tensor` was, and so XLA refuses to handle it! In general, most neural network code obeys rule #2 by default. However, there are a few common cases where it becomes a problem. One very common one is when you use **label masking**, setting your labels to a negative value to indicate that those positions should be ignored when computing the loss. If you look at NumPy or PyTorch loss functions that support label masking, you will often see code like this that uses [boolean indexing](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing): ```python label_mask = labels >= 0 masked_outputs = outputs[label_mask] masked_labels = labels[label_mask] loss = compute_loss(masked_outputs, masked_labels) mean_loss = torch.mean(loss) ``` This code is totally fine in NumPy or PyTorch, but it breaks in XLA! Why? Because the shape of `masked_outputs` and `masked_labels` depends on how many positions are masked - that makes it a **data-dependent shape.** However, just like for rule #1, we can often rewrite this code to yield exactly the same output without any data-dependent shapes. ```python label_mask = tf.cast(labels >= 0, tf.float32) loss = compute_loss(outputs, labels) loss = loss * label_mask # Set negative label positions to 0 mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask) ``` Here, we avoid data-dependent shapes by computing the loss for every position, but zeroing out the masked positions in both the numerator and denominator when we calculate the mean, which yields exactly the same result as the first block while maintaining XLA compatibility. Note that we use the same trick as in rule #1 - converting a `tf.bool` to `tf.float32` and using it as an indicator variable. This is a really useful trick, so remember it if you need to convert your own code to XLA! #### XLA Rule #3: XLA will need to recompile your model for every different input shape it sees This is the big one. What this means is that if your input shapes are very variable, XLA will have to recompile your model over and over, which will create huge performance problems. This commonly arises in NLP models, where input texts have variable lengths after tokenization. In other modalities, static shapes are more common and this rule is much less of a problem. How can you get around rule #3? The key is **padding** - if you pad all your inputs to the same length, and then use an `attention_mask`, you can get the same results as youโ€™d get from variable shapes, but without any XLA issues. However, excessive padding can cause severe slowdown too - if you pad all your samples to the maximum length in the whole dataset, you might end up with batches consisting endless padding tokens, which will waste a lot of compute and memory! There isnโ€™t a perfect solution to this problem. However, you can try some tricks. One very useful trick is to **pad batches of samples up to a multiple of a number like 32 or 64 tokens.** This often only increases the number of tokens by a small amount, but it hugely reduces the number of unique input shapes, because every input shape now has to be a multiple of 32 or 64. Fewer unique input shapes means fewer XLA compilations! **๐Ÿค—Specific HuggingFace Tip๐Ÿค—:** Our tokenizers and data collators have methods that can help you here. You can use `padding="max_length"` or `padding="longest"` when calling tokenizers to get them to output padded data. Our tokenizers and data collators also have a `pad_to_multiple_of` argument that you can use to reduce the number of unique input shapes you see! ### How do I actually train my model on TPU? Once your training is XLA-compatible and (if youโ€™re using TPU Node / Colab) your dataset has been prepared appropriately, running on TPU is surprisingly easy! All you really need to change in your code is to add a few lines to initialize your TPU, and to ensure that your model and dataset are created inside a `TPUStrategy` scope. Take a look at [our TPU example notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) to see this in action! ### Summary There was a lot in here, so letโ€™s summarize with a quick checklist you can follow when you want to get your model ready for TPU training: - Make sure your code follows the three rules of XLA - Compile your model with `jit_compile=True` on CPU/GPU and confirm that you can train it with XLA - Either load your dataset into memory or use a TPU-compatible dataset loading approach (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)) - Migrate your code either to Colab (with accelerator set to โ€œTPUโ€) or a TPU VM on Google Cloud - Add TPU initializer code (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)) - Create your `TPUStrategy` and make sure dataset loading and model creation are inside the `strategy.scope()` (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)) - Donโ€™t forget to take `jit_compile=True` out again when you move to TPU! - ๐Ÿ™๐Ÿ™๐Ÿ™๐Ÿฅบ๐Ÿฅบ๐Ÿฅบ - Call `model.fit()` - You did it! # LLM inference optimization Large language models (LLMs) have pushed text generation applications, such as chat and code completion models, to the next level by producing text that displays a high level of understanding and fluency. But what makes LLMs so powerful - namely their size - also presents challenges for inference. Basic inference is slow because LLMs have to be called repeatedly to generate the next token. The input sequence increases as generation progresses, which takes longer and longer for the LLM to process. LLMs also have billions of parameters, making it a challenge to store and handle all those weights in memory. This guide will show you how to use the optimization techniques available in Transformers to accelerate LLM inference. > [!TIP] > Hugging Face also provides [Text Generation Inference (TGI)](https://hf.co/docs/text-generation-inference), a library dedicated to deploying and serving highly optimized LLMs for inference. It includes deployment-oriented optimization features not included in Transformers, such as continuous batching for increasing throughput and tensor parallelism for multi-GPU inference. ## Static kv-cache and `torch.compile` During decoding, a LLM computes the key-value (kv) values for each input token and since it is autoregressive, it computes the same kv values each time because the generated output becomes part of the input now. This is not very efficient because you're recomputing the same kv values each time. To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. We have an entire guide dedicated to kv-caches [here](./kv_cache). The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with `torch.compile` for up to a 4x speed up. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware. > [!WARNING] > Currently, only [Llama](./model_doc/llama2) and a few other models support static kv-cache and `torch.compile`. Check [this issue](https://github.com/huggingface/transformers/issues/28981) for a live model compatibility list. There are three flavors of static kv-cache usage, depending on the complexity of your task: 1. Basic usage: simply set a flag in `generation_config` (recommended); 2. Advanced usage: handle a cache object for multi-turn generation or a custom generation loop; 3. Advanced usage: compile the entire `generate` function into a single graph, if having a single graph is relevant for you. Select the correct tab below for further instructions on each of these flavors. > [!TIP] > Regardless of the strategy used with `torch.compile`, you can avoid shape-related recompilations if you left-pad your LLM inputs to a limited set of values. The [`pad_to_multiple_of` tokenizer flag](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.pad_to_multiple_of) is your friend! For this example, let's use the [Gemma](https://hf.co/google/gemma-2b) model. All we need to do is to: 1. Access the model's `generation_config` attribute and set the `cache_implementation` to "static"; 2. Call `torch.compile` on the model to compile the forward pass with the static kv-cache. And that's it! ```py from transformers import AutoTokenizer, AutoModelForCausalLM import torch import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") model.generation_config.cache_implementation = "static" model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) input_text = "The theory of special relativity states " input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") outputs = model.generate(**input_ids) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ['The theory of special relativity states 1. The speed of light is constant in all inertial reference'] ``` Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. Avoiding re-compilation is critical to get the most out of `torch.compile`, and you should be aware of the following: 1. If the batch size changes or the maximum output length increases between calls, the cache will have to be reinitialized, triggering a new compilation; 2. The first couple of calls of the compiled function are slower, as the function is being compiled. > [!WARNING] > For a more advanced usage of the static cache, such as multi-turn conversations, we recommend instantiating and manipulating the cache object outside `generate()`. See the advanced usage tab. A `StaticCache` object can be passed to the model's `generate()` under the `past_key_values` argument. The object will retain the cache contents, so you can pass it to a new `generate()` call to continue generation, like you would do with a dynamic cache. ```py from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache import torch import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) input_text = "The theory of special relativity states " input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") prompt_length = input_ids.input_ids.shape[1] model.generation_config.max_new_tokens = 16 past_key_values = StaticCache( config=model.config, batch_size=1, # If you plan to reuse the cache, make sure the cache length is large enough for all cases max_cache_len=prompt_length+(model.generation_config.max_new_tokens*2), device=model.device, dtype=model.dtype ) outputs = model.generate(**input_ids, past_key_values=past_key_values) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2'] # pass in the generated text and the same cache object to continue generation from where it left off. Optionally, in a # multi-turn conversation, append the new user input to the generated text. new_input_ids = outputs outputs = model.generate(new_input_ids, past_key_values=past_key_values) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2. The speed of light is constant in all inertial reference frames. 3.'] ``` > [!TIP] > If you want to reuse the same `StaticCache` object on a new prompt, be sure to reset its contents with the `.reset()` method between calls If you want to go further down a level, the `StaticCache` object can also be passed to the model's forward pass under the same `past_key_values` argument. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. ```py from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging from transformers.testing_utils import CaptureLogger import torch prompts = [ "Simply put, the theory of relativity states that ", "My favorite all time favorite condiment is ketchup.", ] NUM_TOKENS_TO_GENERATE = 40 torch_device = "cuda" tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="
", padding_side="right") model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential") inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_values): logits = model( cur_token, position_ids=input_pos, cache_position=cache_position, past_key_values=past_key_values, return_dict=False, use_cache=True )[0] new_token = torch.argmax(logits[:, -1], dim=-1)[:, None] return new_token ``` There are a few important things you must do to enable static kv-cache and `torch.compile` with the `StaticCache` method: 1. Initialize the `StaticCache` instance before using the model for inference. There you can configure parameters like the maximum batch size and sequence length. 2. Call `torch.compile` on the model to compile the forward pass with the static kv-cache. 3. Set `enable_math=True` in the [torch.backends.cuda.sdp_kernel](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) context manager to enable the native PyTorch C++ implementation of scaled dot product attention to speed up inference even more. ```py batch_size, seq_length = inputs["input_ids"].shape with torch.no_grad(): past_key_values = StaticCache( config=model.config, batch_size=2, max_cache_len=4096, device=torch_device, dtype=model.dtype ) cache_position = torch.arange(seq_length, device=torch_device) generated_ids = torch.zeros( batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device ) generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int) logits = model( **inputs, cache_position=cache_position, past_key_values=past_key_values,return_dict=False, use_cache=True )[0] next_token = torch.argmax(logits[:, -1], dim=-1)[:, None] generated_ids[:, seq_length] = next_token[:, 0] decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True) cache_position = torch.tensor([seq_length + 1], device=torch_device) for _ in range(1, NUM_TOKENS_TO_GENERATE): with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True): next_token = decode_one_tokens(model, next_token.clone(), None, cache_position, past_key_values) generated_ids[:, cache_position] = next_token.int() cache_position += 1 text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) text ['Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.', 'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p'] ``` Compiling the entire `generate` function, in terms of code, is even simpler than in the basic usage: call `torch.compile` on `generate` to compile the entire function. No need to specify the use of the static cache: although it is compatible, dynamic cache (default) was faster in our benchmarks. ```py from transformers import AutoTokenizer, AutoModelForCausalLM import torch import os os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True) input_text = "The theory of special relativity states " input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") outputs = model.generate(**input_ids) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ['The theory of special relativity states 1. The speed of light is constant in all inertial reference'] ``` As a result, we compile not only the model forward pass, but also all input preparation, logit processor operations, and so on. The result should be a slightly `generate` call, compared to the basic usage example, and the compiled graph may be better suited to more exotic hardware devices or use cases. However, there are severe drawbacks in using this approach: 1. Compilation is much slower; 2. All parameterization of `generate` must be done through `generation_config`; 3. Many warnings and exceptions are suppressed -- we suggest testing with its uncompiled form first; 4. Although we are working on it, it is heavily feature restricted (for instance, at the time of writing, generation does not stop if an EOS token is selected). ## Speculative decoding > [!TIP] > For a more in-depth explanation, take a look at the [Assisted Generation: a new direction toward low-latency text generation](https://hf.co/blog/assisted-generation) blog post! Another issue with autoregression is that for each input token you need to load the model weights each time during the forward pass. This is slow and cumbersome for LLMs which have billions of parameters. Speculative decoding alleviates this slowdown by using a second smaller and faster assistant model to generate candidate tokens that are verified by the larger LLM in a single forward pass. If the verified tokens are correct, the LLM essentially gets them for "free" without having to generate them itself. There is no degradation in accuracy because the verification forward pass ensures the same outputs are generated as if the LLM had generated them on its own. To get the largest speed up, the assistant model should be a lot smaller than the LLM so that it can generate tokens quickly. The assistant and LLM model must also share the same tokenizer to avoid re-encoding and decoding tokens. > [!WARNING] > Speculative decoding is only supported for the greedy search and sampling decoding strategies, and it also doesn't support batched inputs. Enable speculative decoding by loading an assistant model and passing it to the `generate()` method. ```py from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b") inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device) assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) tokenizer.batch_decode(outputs, skip_special_tokens=True) ["Einstein's theory of relativity states that the speed of light is constant. "] ``` For speculative sampling decoding, add the `do_sample` and `temperature` parameters to the `generate()` method in addition to the assistant model. ```py from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b") inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device) assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device) outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.7) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ["Einstein's theory of relativity states that motion in the universe is not a straight line.\n"] ``` ### Prompt lookup decoding Prompt lookup decoding is a variant of speculative decoding that is also compatible with greedy search and sampling. Prompt lookup works especially well for input-grounded tasks - such as summarization - where there is often overlapping words between the prompt and output. These overlapping n-grams are used as the LLM candidate tokens. To enable prompt lookup decoding, specify the number of tokens that should be overlapping in the `prompt_lookup_num_tokens` parameter. Then you can pass this parameter to the `generate()` method. ```py from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b") inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device) assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device) outputs = model.generate(**inputs, prompt_lookup_num_tokens=3) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ['The second law of thermodynamics states that entropy increases with temperature. '] ``` For prompt lookup decoding with sampling, add the `do_sample` and `temperature` parameters to the `generate()` method. ```py from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b") inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device) outputs = model.generate(**inputs, prompt_lookup_num_tokens=3, do_sample=True, temperature=0.7) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ["The second law of thermodynamics states that energy cannot be created nor destroyed. It's not a"] ``` ## Attention optimizations A known issue with transformer models is that the self-attention mechanism grows quadratically in compute and memory with the number of input tokens. This limitation is only magnified in LLMs which handles much longer sequences. To address this, try FlashAttention2 or PyTorch's scaled dot product attention (SDPA), which are more memory efficient attention implementations and can accelerate inference. ### FlashAttention-2 FlashAttention and [FlashAttention-2](./perf_infer_gpu_one#flashattention-2) break up the attention computation into smaller chunks and reduces the number of intermediate read/write operations to GPU memory to speed up inference. FlashAttention-2 improves on the original FlashAttention algorithm by also parallelizing over sequence length dimension and better partitioning work on the hardware to reduce synchronization and communication overhead. To use FlashAttention-2, set `attn_implementation="flash_attention_2"` in the `from_pretrained()` method. ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig quant_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( "google/gemma-2b", quantization_config=quant_config, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) ``` ### Fine-Tuning with torch.compile and Padding-Free Data Collation In addition to optimizing inference, you can also enhance the training efficiency of large language models by leveraging torch.compile during fine-tuning and using a padding-free data collator. This approach can significantly speed up training and reduce computational overhead. Here's how you can fine-tune a Llama model using SFTTrainer from the TRL library, with torch_compile enabled and a padding-free data collator: ``` #################### IMPORTS ################### import math import datasets import dataclasses from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments ) from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM #################### MODEL LOADING WITH FLASH ATTENTION ################### model_name = "meta-llama/Llama-3.2-1B" model = AutoModelForCausalLM.from_pretrained( model_name, attn_implementation="flash_attention_2" # Enables FlashAttention-2 ) tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) #################### DATA PREPROCESSING (PADDING-FREE) ################### response_template = "\n### Label:" response_template_ids = tokenizer.encode( response_template, add_special_tokens=False )[2:] # Exclude special tokens data_collator = DataCollatorForCompletionOnlyLM( response_template_ids=response_template_ids, tokenizer=tokenizer, ignore_index=-100, padding_free=True # Enables padding-free collation ) def format_dataset(example): return { "output": example["output"] + tokenizer.eos_token } data_files = {"train": "path/to/dataset"} # Replace with your dataset path json_dataset = datasets.load_dataset("json", data_files=data_files) formatted_train_dataset = json_dataset["train"].map(format_dataset) ################# TRAINING CONFIGURATION ############################ train_args = TrainingArguments( num_train_epochs=5, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-5, weight_decay=0.0, warmup_ratio=0.03, lr_scheduler_type="cosine", logging_steps=1, include_tokens_per_second=True, save_strategy="epoch", output_dir="output", torch_compile=True, # Enables torch.compile torch_compile_backend="inductor", torch_compile_mode="default" ) # Convert TrainingArguments to SFTConfig transformer_train_arg_fields = [x.name for x in dataclasses.fields(SFTConfig)] transformer_kwargs = { k: v for k, v in train_args.to_dict().items() if k in transformer_train_arg_fields } training_args = SFTConfig(**transformer_kwargs) ####################### FINE-TUNING ##################### trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=formatted_train_dataset, data_collator=data_collator, dataset_text_field="output", args=training_args, ) trainer.train() ``` ### PyTorch scaled dot product attention Scaled dot product attention (SDPA) is automatically enabled in PyTorch 2.0 and it supports FlashAttention, xFormers, and PyTorch's C++ implementation. SDPA chooses the most performant attention algorithm if you're using a CUDA backend. For other backends, SDPA defaults to the PyTorch C++ implementation. > [!TIP] > SDPA supports FlashAttention-2 as long as you have the latest PyTorch version installed. Use the [torch.backends.cuda.sdp_kernel](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) context manager to explicitly enable or disable any of the three attention algorithms. For example, set `enable_flash=True` to enable FlashAttention. ```py import torch from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "google/gemma-2b", torch_dtype=torch.bfloat16, ) with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): outputs = model.generate(**inputs) ``` ## Quantization Quantization reduces the size of the LLM weights by storing them in a lower precision. This translates to lower memory usage and makes loading LLMs for inference more accessible if you're constrained by your GPUs memory. If you aren't limited by your GPU, you don't necessarily need to quantize your model because it can incur a small latency cost (except for AWQ and fused AWQ modules) due to the extra step required to quantize and dequantize the weights. > [!TIP] > There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, AWQ, and AutoGPTQ. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in ๐Ÿค— Transformers](https://hf.co/blog/overview-quantization-transformers) blog post which compares AutoGPTQ and bitsandbytes. Use the Model Memory Calculator below to estimate and compare how much memory is required to load a model. For example, try estimating how much memory it costs to load [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). To load Mistral-7B-v0.1 in half-precision, set the `torch_dtype` parameter in the `from_pretrained()` method to `torch.bfloat16`. This requires 13.74GB of memory. ```py from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", torch_dtype=torch.bfloat16, device_map="auto", ) ``` To load a quantized model (8-bit or 4-bit) for inference, try [bitsandbytes](https://hf.co/docs/bitsandbytes) and set the `load_in_4bit` or `load_in_8bit` parameters to `True`. Loading the model in 8-bits only requires 6.87 GB of memory. ```py from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch quant_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", quantization_config=quant_config, device_map="auto" ) ``` # Optimize inference using torch.compile() This guide aims to provide a benchmark on the inference speed-ups introduced with [`torch.compile()`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)ย for [computer vision models in ๐Ÿค— Transformers](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers&sort=trending). ## Benefits of torch.compile Depending on the model and the GPU, `torch.compile()` yields up to 30% speed-up during inference. To use `torch.compile()`, simply install any version of `torch` above 2.0. Compiling a model takes time, so it's useful if you are compiling the model only once instead of every time you infer. To compile any computer vision model of your choice, call `torch.compile()` on the model as shown below: ```diff from transformers import AutoModelForImageClassification model = AutoModelForImageClassification.from_pretrained(MODEL_ID).to("cuda") + model = torch.compile(model) ``` `compile()`ย comes with multiple modes for compiling, which essentially differ in compilation time and inference overhead. `max-autotune`ย takes longer than `reduce-overhead`ย but results in faster inference. Default mode is fastest for compilation but is not as efficient compared to `reduce-overhead` for inference time. In this guide, we used the default mode. You can learn more about it [here](https://pytorch.org/get-started/pytorch-2.0/#user-experience). We benchmarked `torch.compile` with different computer vision models, tasks, types of hardware, and batch sizes on `torch`ย version 2.0.1. ## Benchmarking code Below you can find the benchmarking code for each task. We warm up the GPU before inference and take the mean time of 300 inferences, using the same image each time. ### Image Classification with ViT ```python import torch from PIL import Image import requests import numpy as np from transformers import AutoImageProcessor, AutoModelForImageClassification url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224").to("cuda") model = torch.compile(model) processed_input = processor(image, return_tensors='pt').to(device="cuda") with torch.no_grad(): _ = model(**processed_input) ``` #### Object Detection with DETR ```python from transformers import AutoImageProcessor, AutoModelForObjectDetection processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50") model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50").to("cuda") model = torch.compile(model) texts = ["a photo of a cat", "a photo of a dog"] inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda") with torch.no_grad(): _ = model(**inputs) ``` #### Image Segmentation with Segformer ```python from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512").to("cuda") model = torch.compile(model) seg_inputs = processor(images=image, return_tensors="pt").to("cuda") with torch.no_grad(): _ = model(**seg_inputs) ``` Below you can find the list of the models we benchmarked. **Image Classification** - [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) - [microsoft/beit-base-patch16-224-pt22k-ft22k](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k) - [facebook/convnext-large-224](https://huggingface.co/facebook/convnext-large-224) - [microsoft/resnet-50](https://huggingface.co/microsoft/resnet-50) **Image Segmentation** - [nvidia/segformer-b0-finetuned-ade-512-512](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) - [facebook/mask2former-swin-tiny-coco-panoptic](https://huggingface.co/facebook/mask2former-swin-tiny-coco-panoptic) - [facebook/maskformer-swin-base-ade](https://huggingface.co/facebook/maskformer-swin-base-ade) - [google/deeplabv3_mobilenet_v2_1.0_513](https://huggingface.co/google/deeplabv3_mobilenet_v2_1.0_513) **Object Detection** - [google/owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32) - [facebook/detr-resnet-101](https://huggingface.co/facebook/detr-resnet-101) - [microsoft/conditional-detr-resnet-50](https://huggingface.co/microsoft/conditional-detr-resnet-50) Below you can find visualization of inference durations with and without `torch.compile()`ย and percentage improvements for each model in different hardware and batch sizes.
![Duration Comparison on V100 with Batch Size of 1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/v100_1_duration.png) ![Percentage Improvement on T4 with Batch Size of 4](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/T4_4_percentage.png) Below you can find inference durations in milliseconds for each model with and without `compile()`. Note that OwlViT results in OOM in larger batch sizes. ### A100 (batch size: 1) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 9.325 | 7.584 | | Image Segmentation/Segformer | 11.759 | 10.500 | | Object Detection/OwlViT | 24.978 | 18.420 | | Image Classification/BeiT | 11.282 | 8.448 | | Object Detection/DETR | 34.619 | 19.040 | | Image Classification/ConvNeXT | 10.410 | 10.208 | | Image Classification/ResNet | 6.531 | 4.124 | | Image Segmentation/Mask2former | 60.188 | 49.117 | | Image Segmentation/Maskformer | 75.764 | 59.487 | | Image Segmentation/MobileNet | 8.583 | 3.974 | | Object Detection/Resnet-101 | 36.276 | 18.197 | | Object Detection/Conditional-DETR | 31.219 | 17.993 | ### A100 (batch size: 4) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 14.832 | 14.499 | | Image Segmentation/Segformer | 18.838 | 16.476 | | Image Classification/BeiT | 13.205 | 13.048 | | Object Detection/DETR | 48.657 | 32.418| | Image Classification/ConvNeXT | 22.940 | 21.631 | | Image Classification/ResNet | 6.657 | 4.268 | | Image Segmentation/Mask2former | 74.277 | 61.781 | | Image Segmentation/Maskformer | 180.700 | 159.116 | | Image Segmentation/MobileNet | 14.174 | 8.515 | | Object Detection/Resnet-101 | 68.101 | 44.998 | | Object Detection/Conditional-DETR | 56.470 | 35.552 | ### A100 (batch size: 16) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 40.944 | 40.010 | | Image Segmentation/Segformer | 37.005 | 31.144 | | Image Classification/BeiT | 41.854 | 41.048 | | Object Detection/DETR | 164.382 | 161.902 | | Image Classification/ConvNeXT | 82.258 | 75.561 | | Image Classification/ResNet | 7.018 | 5.024 | | Image Segmentation/Mask2former | 178.945 | 154.814 | | Image Segmentation/Maskformer | 638.570 | 579.826 | | Image Segmentation/MobileNet | 51.693 | 30.310 | | Object Detection/Resnet-101 | 232.887 | 155.021 | | Object Detection/Conditional-DETR | 180.491 | 124.032 | ### V100 (batch size: 1) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 10.495 | 6.00 | | Image Segmentation/Segformer | 13.321 | 5.862 | | Object Detection/OwlViT | 25.769 | 22.395 | | Image Classification/BeiT | 11.347 | 7.234 | | Object Detection/DETR | 33.951 | 19.388 | | Image Classification/ConvNeXT | 11.623 | 10.412 | | Image Classification/ResNet | 6.484 | 3.820 | | Image Segmentation/Mask2former | 64.640 | 49.873 | | Image Segmentation/Maskformer | 95.532 | 72.207 | | Image Segmentation/MobileNet | 9.217 | 4.753 | | Object Detection/Resnet-101 | 52.818 | 28.367 | | Object Detection/Conditional-DETR | 39.512 | 20.816 | ### V100 (batch size: 4) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 15.181 | 14.501 | | Image Segmentation/Segformer | 16.787 | 16.188 | | Image Classification/BeiT | 15.171 | 14.753 | | Object Detection/DETR | 88.529 | 64.195 | | Image Classification/ConvNeXT | 29.574 | 27.085 | | Image Classification/ResNet | 6.109 | 4.731 | | Image Segmentation/Mask2former | 90.402 | 76.926 | | Image Segmentation/Maskformer | 234.261 | 205.456 | | Image Segmentation/MobileNet | 24.623 | 14.816 | | Object Detection/Resnet-101 | 134.672 | 101.304 | | Object Detection/Conditional-DETR | 97.464 | 69.739 | ### V100 (batch size: 16) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 52.209 | 51.633 | | Image Segmentation/Segformer | 61.013 | 55.499 | | Image Classification/BeiT | 53.938 | 53.581 | | Object Detection/DETR | OOM | OOM | | Image Classification/ConvNeXT | 109.682 | 100.771 | | Image Classification/ResNet | 14.857 | 12.089 | | Image Segmentation/Mask2former | 249.605 | 222.801 | | Image Segmentation/Maskformer | 831.142 | 743.645 | | Image Segmentation/MobileNet | 93.129 | 55.365 | | Object Detection/Resnet-101 | 482.425 | 361.843 | | Object Detection/Conditional-DETR | 344.661 | 255.298 | ### T4 (batch size: 1) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 16.520 | 15.786 | | Image Segmentation/Segformer | 16.116 | 14.205 | | Object Detection/OwlViT | 53.634 | 51.105 | | Image Classification/BeiT | 16.464 | 15.710 | | Object Detection/DETR | 73.100 | 53.99 | | Image Classification/ConvNeXT | 32.932 | 30.845 | | Image Classification/ResNet | 6.031 | 4.321 | | Image Segmentation/Mask2former | 79.192 | 66.815 | | Image Segmentation/Maskformer | 200.026 | 188.268 | | Image Segmentation/MobileNet | 18.908 | 11.997 | | Object Detection/Resnet-101 | 106.622 | 82.566 | | Object Detection/Conditional-DETR | 77.594 | 56.984 | ### T4 (batch size: 4) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 43.653 | 43.626 | | Image Segmentation/Segformer | 45.327 | 42.445 | | Image Classification/BeiT | 52.007 | 51.354 | | Object Detection/DETR | 277.850 | 268.003 | | Image Classification/ConvNeXT | 119.259 | 105.580 | | Image Classification/ResNet | 13.039 | 11.388 | | Image Segmentation/Mask2former | 201.540 | 184.670 | | Image Segmentation/Maskformer | 764.052 | 711.280 | | Image Segmentation/MobileNet | 74.289 | 48.677 | | Object Detection/Resnet-101 | 421.859 | 357.614 | | Object Detection/Conditional-DETR | 289.002 | 226.945 | ### T4 (batch size: 16) | **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:| | Image Classification/ViT | 163.914 | 160.907 | | Image Segmentation/Segformer | 192.412 | 163.620 | | Image Classification/BeiT | 188.978 | 187.976 | | Object Detection/DETR | OOM | OOM | | Image Classification/ConvNeXT | 422.886 | 388.078 | | Image Classification/ResNet | 44.114 | 37.604 | | Image Segmentation/Mask2former | 756.337 | 695.291 | | Image Segmentation/Maskformer | 2842.940 | 2656.88 | | Image Segmentation/MobileNet | 299.003 | 201.942 | | Object Detection/Resnet-101 | 1619.505 | 1262.758 | | Object Detection/Conditional-DETR | 1137.513 | 897.390| ## PyTorch Nightly We also benchmarked on PyTorch nightly (2.1.0dev, find the wheel [here](https://download.pytorch.org/whl/nightly/cu118)) and observed improvement in latency both for uncompiled and compiled models. ### A100 | **Task/Model** | **Batch Size** | **torch 2.0 - no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:|:---:| | Image Classification/BeiT | Unbatched | 12.462 | 6.954 | | Image Classification/BeiT | 4 | 14.109 | 12.851 | | Image Classification/BeiT | 16 | 42.179 | 42.147 | | Object Detection/DETR | Unbatched | 30.484 | 15.221 | | Object Detection/DETR | 4 | 46.816 | 30.942 | | Object Detection/DETR | 16 | 163.749 | 163.706 | ### T4 | **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:|:---:| | Image Classification/BeiT | Unbatched | 14.408 | 14.052 | | Image Classification/BeiT | 4 | 47.381 | 46.604 | | Image Classification/BeiT | 16 | 42.179 | 42.147 | | Object Detection/DETR | Unbatched | 68.382 | 53.481 | | Object Detection/DETR | 4 | 269.615 | 204.785 | | Object Detection/DETR | 16 | OOM | OOM | ### V100 | **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:|:---:| | Image Classification/BeiT | Unbatched | 13.477 | 7.926 | | Image Classification/BeiT | 4 | 15.103 | 14.378 | | Image Classification/BeiT | 16 | 52.517 | 51.691 | | Object Detection/DETR | Unbatched | 28.706 | 19.077 | | Object Detection/DETR | 4 | 88.402 | 62.949| | Object Detection/DETR | 16 | OOM | OOM | ## Reduce Overhead We benchmarked `reduce-overhead` compilation mode for A100 and T4 in Nightly. ### A100 | **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:|:---:| | Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 | | Image Classification/ConvNeXT | 4 | 23.171 | 21.490 | | Image Classification/ResNet | Unbatched | 7.435 | 3.801 | | Image Classification/ResNet | 4 | 7.261 | 2.187 | | Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 | | Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 | | Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 | | Image Segmentation/MobileNet | 4 | 14.385 | 7.946 | ### T4 | **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | |:---:|:---:|:---:|:---:| | Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 | | Image Classification/ConvNeXT | 4 | 120.944 | 110.209 | | Image Classification/ResNet | Unbatched | 9.761 | 7.698 | | Image Classification/ResNet | 4 | 15.215 | 13.871 | | Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 | | Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 | | Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 | | Image Segmentation/MobileNet | 4 | 78.311 | 50.983 | # Benchmarks Hugging Face's Benchmarking tools are deprecated and it is advised to use external Benchmarking libraries to measure the speed and memory complexity of Transformer models. Let's take a look at how ๐Ÿค— Transformers models can be benchmarked, best practices, and already available benchmarks. A notebook explaining in more detail how to benchmark ๐Ÿค— Transformers models can be found [here](https://github.com/huggingface/notebooks/tree/main/examples/benchmark.ipynb). ## How to benchmark ๐Ÿค— Transformers models The classes `PyTorchBenchmark` and `TensorFlowBenchmark` allow to flexibly benchmark ๐Ÿค— Transformers models. The benchmark classes allow us to measure the _peak memory usage_ and _required time_ for both _inference_ and _training_. Here, _inference_ is defined by a single forward pass, and _training_ is defined by a single forward pass and backward pass. The benchmark classes `PyTorchBenchmark` and `TensorFlowBenchmark` expect an object of type `PyTorchBenchmarkArguments` and `TensorFlowBenchmarkArguments`, respectively, for instantiation. `PyTorchBenchmarkArguments` and `TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type _bert-base-cased_ can be benchmarked. ```py >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments >>> args = PyTorchBenchmarkArguments(models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) >>> benchmark = PyTorchBenchmark(args) ``` Here, three arguments are given to the benchmark argument data classes, namely `models`, `batch_sizes`, and `sequence_lengths`. The argument `models` is required and expects a `list` of model identifiers from the [model hub](https://huggingface.co/models) The `list` arguments `batch_sizes` and `sequence_lengths` define the size of the `input_ids` on which the model is benchmarked. There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files `src/transformers/benchmark/benchmark_args_utils.py`, `src/transformers/benchmark/benchmark_args.py` (for PyTorch) and `src/transformers/benchmark/benchmark_args_tf.py` (for Tensorflow). Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively. ```bash python examples/pytorch/benchmarking/run_benchmark.py --help ``` An instantiated benchmark object can then simply be run by calling `benchmark.run()`. ```py >>> results = benchmark.run() >>> print(results) ==================== INFERENCE - SPEED - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Time in s -------------------------------------------------------------------------------- google-bert/bert-base-uncased 8 8 0.006 google-bert/bert-base-uncased 8 32 0.006 google-bert/bert-base-uncased 8 128 0.018 google-bert/bert-base-uncased 8 512 0.088 -------------------------------------------------------------------------------- ==================== INFERENCE - MEMORY - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Memory in MB -------------------------------------------------------------------------------- google-bert/bert-base-uncased 8 8 1227 google-bert/bert-base-uncased 8 32 1281 google-bert/bert-base-uncased 8 128 1307 google-bert/bert-base-uncased 8 512 1539 -------------------------------------------------------------------------------- ==================== ENVIRONMENT INFORMATION ==================== - transformers_version: 2.11.0 - framework: PyTorch - use_torchscript: False - framework_version: 1.4.0 - python_version: 3.6.10 - system: Linux - cpu: x86_64 - architecture: 64bit - date: 2020-06-29 - time: 08:58:43.371351 - fp16: False - use_multiprocessing: True - only_pretrain_model: False - cpu_ram_mb: 32088 - use_gpu: True - num_gpus: 1 - gpu: TITAN RTX - gpu_ram_mb: 24217 - gpu_power_watts: 280.0 - gpu_performance_state: 2 - use_tpu: False ``` By default, the _time_ and the _required memory_ for _inference_ are benchmarked. In the example output above the first two sections show the result corresponding to _inference time_ and _inference memory_. In addition, all relevant information about the computing environment, _e.g._ the GPU type, the system, the library versions, etc... are printed out in the third section under _ENVIRONMENT INFORMATION_. This information can optionally be saved in a _.csv_ file when adding the argument `save_to_csv=True` to `PyTorchBenchmarkArguments` and `TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate _.csv_ file. The path to each _.csv_ file can optionally be defined via the argument data classes. Instead of benchmarking pre-trained models via their model identifier, _e.g._ `google-bert/bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of configurations must be inserted with the benchmark args as follows. ```py >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig >>> args = PyTorchBenchmarkArguments( ... models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] ... ) >>> config_base = BertConfig() >>> config_384_hid = BertConfig(hidden_size=384) >>> config_6_lay = BertConfig(num_hidden_layers=6) >>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) >>> benchmark.run() ==================== INFERENCE - SPEED - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Time in s -------------------------------------------------------------------------------- bert-base 8 128 0.006 bert-base 8 512 0.006 bert-base 8 128 0.018 bert-base 8 512 0.088 bert-384-hid 8 8 0.006 bert-384-hid 8 32 0.006 bert-384-hid 8 128 0.011 bert-384-hid 8 512 0.054 bert-6-lay 8 8 0.003 bert-6-lay 8 32 0.004 bert-6-lay 8 128 0.009 bert-6-lay 8 512 0.044 -------------------------------------------------------------------------------- ==================== INFERENCE - MEMORY - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Memory in MB -------------------------------------------------------------------------------- bert-base 8 8 1277 bert-base 8 32 1281 bert-base 8 128 1307 bert-base 8 512 1539 bert-384-hid 8 8 1005 bert-384-hid 8 32 1027 bert-384-hid 8 128 1035 bert-384-hid 8 512 1255 bert-6-lay 8 8 1097 bert-6-lay 8 32 1101 bert-6-lay 8 128 1127 bert-6-lay 8 512 1359 -------------------------------------------------------------------------------- ==================== ENVIRONMENT INFORMATION ==================== - transformers_version: 2.11.0 - framework: PyTorch - use_torchscript: False - framework_version: 1.4.0 - python_version: 3.6.10 - system: Linux - cpu: x86_64 - architecture: 64bit - date: 2020-06-29 - time: 09:35:25.143267 - fp16: False - use_multiprocessing: True - only_pretrain_model: False - cpu_ram_mb: 32088 - use_gpu: True - num_gpus: 1 - gpu: TITAN RTX - gpu_ram_mb: 24217 - gpu_power_watts: 280.0 - gpu_performance_state: 2 - use_tpu: False ``` Again, _inference time_ and _required memory_ for _inference_ are measured, but this time for customized configurations of the `BertModel` class. This feature can especially be helpful when deciding for which configuration the model should be trained. ## Benchmark best practices This section lists a couple of best practices one should be aware of when benchmarking a model. - Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user specifies on which device the code should be run by setting the `CUDA_VISIBLE_DEVICES` environment variable in the shell, _e.g._ `export CUDA_VISIBLE_DEVICES=0` before running the code. - The option `no_multi_processing` should only be set to `True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure `no_multi_processing` is set to `True`. - One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., as a consequence, benchmark results on their own are not very useful for the community. ## Sharing your benchmark Previously all available core models (10 at the time) have been benchmarked for _inference time_, across many different settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for TensorFlow XLA) and GPUs. The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2) and the results are available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing). With the new _benchmark_ tools, it is easier than ever to share your benchmark results with the community - [PyTorch Benchmarking Results](https://github.com/huggingface/transformers/tree/main/examples/pytorch/benchmarking/README.md). - [TensorFlow Benchmarking Results](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/benchmarking/README.md). # ๐Ÿค— Transformers Notebooks You can find here a list of the official notebooks provided by Hugging Face. Also, we would like to list here interesting content created by the community. If you wrote some notebook(s) leveraging ๐Ÿค— Transformers and would like to be listed here, please open a Pull Request so it can be included under the Community notebooks. ## Hugging Face's notebooks ๐Ÿค— ### Documentation notebooks You can open any page of the documentation as a notebook in Colab (there is a button directly on said pages) but they are also listed here if you need them: | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [Quicktour of the library](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb) | A presentation of the various APIs in Transformers |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/en/transformers_doc/quicktour.ipynb)| | [Summary of the tasks](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb) | How to run the models of the Transformers library task by task |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)| | [Preprocessing data](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb) | How to use a tokenizer to preprocess your data |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb)| | [Fine-tuning a pretrained model](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb) | How to use the Trainer to fine-tune a pretrained model |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)| | [Summary of the tokenizers](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb) | The differences between the tokenizers algorithm |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| | [Multilingual models](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb) | How to use the multilingual models of the library |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| ### PyTorch Examples #### Natural Language Processing[[pytorch-nlp]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [Train your tokenizer](https://github.com/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb) | How to train and use your very own tokenizer |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| | [Train your language model](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) | How to easily start using transformers |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)| | [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)| | [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| | [How to fine-tune a model on token classification](https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a token classification task (NER, PoS). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)| | [How to fine-tune a model on question answering](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SQUAD. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)| | [How to fine-tune a model on multiple choice](https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SWAG. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)| | [How to fine-tune a model on translation](https://github.com/huggingface/notebooks/blob/main/examples/translation.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on WMT. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/translation.ipynb)| | [How to fine-tune a model on summarization](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on XSUM. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)| | [How to train a language model from scratch](https://github.com/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)| | [How to generate text](https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)| | [How to generate text (with constraints)](https://github.com/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)| How to guide language generation with user-provided constraints | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)| | [Reformer](https://github.com/huggingface/blog/blob/main/notebooks/03_reformer.ipynb)| How Reformer pushes the limits of language modeling | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/blog/blob/main/notebooks/03_reformer.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/patrickvonplaten/blog/blob/main/notebooks/03_reformer.ipynb)| #### Computer Vision[[pytorch-cv]] | Notebook | Description | | | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:| | [How to fine-tune a model on image classification (Torchvision)](https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb) | Show how to preprocess the data using Torchvision and fine-tune any pretrained Vision model on Image Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)| | [How to fine-tune a model on image classification (Albumentations)](https://github.com/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) | Show how to preprocess the data using Albumentations and fine-tune any pretrained Vision model on Image Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb)| | [How to fine-tune a model on image classification (Kornia)](https://github.com/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb) | Show how to preprocess the data using Kornia and fine-tune any pretrained Vision model on Image Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb)| | [How to perform zero-shot object detection with OWL-ViT](https://github.com/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb) | Show how to perform zero-shot object detection on images with text queries | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb)| | [How to fine-tune an image captioning model](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) | Show how to fine-tune BLIP for image captioning on a custom dataset | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb)| | [How to build an image similarity system with Transformers](https://github.com/huggingface/notebooks/blob/main/examples/image_similarity.ipynb) | Show how to build an image similarity system | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_similarity.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_similarity.ipynb)| | [How to fine-tune a SegFormer model on semantic segmentation](https://github.com/huggingface/notebooks/blob/main/examples/semantic_segmentation.ipynb) | Show how to preprocess the data and fine-tune a pretrained SegFormer model on Semantic Segmentation | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/semantic_segmentation.ipynb)| | [How to fine-tune a VideoMAE model on video classification](https://github.com/huggingface/notebooks/blob/main/examples/video_classification.ipynb) | Show how to preprocess the data and fine-tune a pretrained VideoMAE model on Video Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/video_classification.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/video_classification.ipynb)| #### Audio[[pytorch-audio]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [How to fine-tune a speech recognition model in English](https://github.com/huggingface/notebooks/blob/main/examples/speech_recognition.ipynb)| Show how to preprocess the data and fine-tune a pretrained Speech model on TIMIT | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/speech_recognition.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/speech_recognition.ipynb)| | [How to fine-tune a speech recognition model in any language](https://github.com/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| Show how to preprocess the data and fine-tune a multi-lingually pretrained speech model on Common Voice | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| | [How to fine-tune a model on audio classification](https://github.com/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained Speech model on Keyword Spotting | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| #### Biological Sequences[[pytorch-bio]] | Notebook | Description | | | |:----------|:----------------------------------------------------------------------------------------|:-------------|------:| | [How to fine-tune a pre-trained protein model](https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | See how to tokenize proteins and fine-tune a large pre-trained protein "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | | [How to generate protein folds](https://github.com/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | See how to go from protein sequence to a full protein model and PDB file | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | | [How to fine-tune a Nucleotide Transformer model](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | See how to tokenize DNA and fine-tune a large pre-trained DNA "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | | [Fine-tune a Nucleotide Transformer model with LoRA](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | Train even larger DNA models in a memory-efficient way | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | #### Other modalities[[pytorch-other]] | Notebook | Description | | | |:----------|:----------------------------------------------------------------------------------------|:-------------|------:| | [Probabilistic Time Series Forecasting](https://github.com/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | See how to train Time Series Transformer on a custom dataset | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | #### Utility notebooks[[pytorch-utility]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [How to export model to ONNX](https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| Highlight how to export and run inference workloads through ONNX | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| | [How to use Benchmarks](https://github.com/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| How to benchmark models with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| ### TensorFlow Examples #### Natural Language Processing[[tensorflow-nlp]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [Train your tokenizer](https://github.com/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb) | How to train and use your very own tokenizer |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| | [Train your language model](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch-tf.ipynb) | How to easily start using transformers |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch-tf.ipynb)| | [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)| | [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)| | [How to fine-tune a model on token classification](https://github.com/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a token classification task (NER, PoS). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)| | [How to fine-tune a model on question answering](https://github.com/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SQUAD. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)| | [How to fine-tune a model on multiple choice](https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SWAG. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)| | [How to fine-tune a model on translation](https://github.com/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on WMT. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)| | [How to fine-tune a model on summarization](https://github.com/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on XSUM. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)| #### Computer Vision[[tensorflow-cv]] | Notebook | Description | | | |:---------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------|:-------------|------:| | [How to fine-tune a model on image classification](https://github.com/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb) | Show how to preprocess the data and fine-tune any pretrained Vision model on Image Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb)| | [How to fine-tune a SegFormer model on semantic segmentation](https://github.com/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb) | Show how to preprocess the data and fine-tune a pretrained SegFormer model on Semantic Segmentation | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb)| #### Biological Sequences[[tensorflow-bio]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [How to fine-tune a pre-trained protein model](https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling-tf.ipynb) | See how to tokenize proteins and fine-tune a large pre-trained protein "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling-tf.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/protein_language_modeling-tf.ipynb) | #### Utility notebooks[[tensorflow-utility]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [How to train TF/Keras models on TPU](https://github.com/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) | See how to train at high speed on Google's TPU hardware | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) | ### Optimum notebooks ๐Ÿค— [Optimum](https://github.com/huggingface/optimum) is an extension of ๐Ÿค— Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardwares. | Notebook | Description | | | |:----------|:-------------|:-------------|------:| | [How to quantize a model with ONNX Runtime for text classification](https://github.com/huggingface/notebooks/blob/main/examples/text_classification_quantization_ort.ipynb)| Show how to apply static and dynamic quantization on a model using [ONNX Runtime](https://github.com/microsoft/onnxruntime) for any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_quantization_ort.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification_quantization_ort.ipynb)| | [How to quantize a model with Intel Neural Compressor for text classification](https://github.com/huggingface/notebooks/blob/main/examples/text_classification_quantization_inc.ipynb)| Show how to apply static, dynamic and aware training quantization on a model using [Intel Neural Compressor (INC)](https://github.com/intel/neural-compressor) for any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_quantization_inc.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification_quantization_inc.ipynb)| | [How to fine-tune a model on text classification with ONNX Runtime](https://github.com/huggingface/notebooks/blob/main/examples/text_classification_ort.ipynb)| Show how to preprocess the data and fine-tune a model on any GLUE task using [ONNX Runtime](https://github.com/microsoft/onnxruntime). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_ort.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification_ort.ipynb)| | [How to fine-tune a model on summarization with ONNX Runtime](https://github.com/huggingface/notebooks/blob/main/examples/summarization_ort.ipynb)| Show how to preprocess the data and fine-tune a model on XSUM using [ONNX Runtime](https://github.com/microsoft/onnxruntime). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization_ort.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/summarization_ort.ipynb)| ## Community notebooks: More notebooks developed by the community are available [here](https://hf.co/docs/transformers/community#community-notebooks). # Methods and tools for efficient training on a single GPU This guide demonstrates practical techniques that you can use to increase the efficiency of your model's training by optimizing memory utilization, speeding up the training, or both. If you'd like to understand how GPU is utilized during training, please refer to the [Model training anatomy](model_memory_anatomy) conceptual guide first. This guide focuses on practical techniques. If you have access to a machine with multiple GPUs, these approaches are still valid, plus you can leverage additional methods outlined in the [multi-GPU section](perf_train_gpu_many). When training large models, there are two aspects that should be considered at the same time: * Data throughput/training time * Model performance Maximizing the throughput (samples/second) leads to lower training cost. This is generally achieved by utilizing the GPU as much as possible and thus filling GPU memory to its limit. If the desired batch size exceeds the limits of the GPU memory, the memory optimization techniques, such as gradient accumulation, can help. However, if the preferred batch size fits into memory, there's no reason to apply memory-optimizing techniques because they can slow down the training. Just because one can use a large batch size, does not necessarily mean they should. As part of hyperparameter tuning, you should determine which batch size yields the best results and then optimize resources accordingly. The methods and tools covered in this guide can be classified based on the effect they have on the training process: | Method/tool | Improves training speed | Optimizes memory utilization | |:--------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------|:-----------------------------| | [Batch size choice](#batch-size-choice) | Yes | Yes | | [Gradient accumulation](#gradient-accumulation) | No | Yes | | [Gradient checkpointing](#gradient-checkpointing) | No | Yes | | [Mixed precision training](#mixed-precision-training) | Yes | Maybe* | | [torch_empty_cache_steps](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.torch_empty_cache_steps) | No | Yes | | [Optimizer choice](#optimizer-choice) | Yes | Yes | | [Data preloading](#data-preloading) | Yes | No | | [DeepSpeed Zero](#deepspeed-zero) | No | Yes | | [torch.compile](#using-torchcompile) | Yes | No | | [Parameter-Efficient Fine Tuning (PEFT)](#using--peft) | No | Yes | *Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a large model and a small batch size, the memory use will be larger. You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are training your model with `Trainer` or writing a pure PyTorch loop, in which case you can [configure these optimizations with ๐Ÿค— Accelerate](#using--accelerate). If these methods do not result in sufficient gains, you can explore the following options: * [Look into building your own custom Docker container with efficient software prebuilds](#efficient-software-prebuilds) * [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts) * [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention-and-flash-attention) Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism techniques outlined in the [multi-GPU section](perf_train_gpu_many). ## Batch size choice To achieve optimal performance, start by identifying the appropriate batch size. It is recommended to use batch sizes and input/output neuron counts that are of size 2^N. Often it's a multiple of 8, but it can be higher depending on the hardware being used and the model's dtype. For reference, check out NVIDIA's recommendation for [input/output neuron counts]( https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) and [batch size](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#batch-size) for fully connected layers (which are involved in GEMMs (General Matrix Multiplications)). [Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) define the multiplier based on the dtype and the hardware. For instance, for fp16 data type a multiple of 8 is recommended, unless it's an A100 GPU, in which case use multiples of 64. For parameters that are small, consider also [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization). This is where tiling happens and the right multiplier can have a significant speedup. ## Gradient Accumulation The **gradient accumulation** method aims to calculate gradients in smaller increments instead of computing them for the entire batch at once. This approach involves iteratively calculating gradients in smaller batches by performing forward and backward passes through the model and accumulating the gradients during the process. Once a sufficient number of gradients have been accumulated, the model's optimization step is executed. By employing gradient accumulation, it becomes possible to increase the **effective batch size** beyond the limitations imposed by the GPU's memory capacity. However, it is important to note that the additional forward and backward passes introduced by gradient accumulation can slow down the training process. You can enable gradient accumulation by adding the `gradient_accumulation_steps` argument to `TrainingArguments`: ```py training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args) ``` In the above example, your effective batch size becomes 4. Alternatively, use ๐Ÿค— Accelerate to gain full control over the training loop. Find the ๐Ÿค— Accelerate example [further down in this guide](#using--accelerate). While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4` without gradient accumulation hits the GPU's limit. If you would like to train with batches of size 64, do not set the `per_device_train_batch_size` to 1 and `gradient_accumulation_steps` to 64. Instead, keep `per_device_train_batch_size=4` and set `gradient_accumulation_steps=16`. This results in the same effective batch size while making better use of the available GPU resources. For additional information, please refer to batch size and gradient accumulation benchmarks for [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957). ## Gradient Checkpointing Some large models may still face memory issues even when the batch size is set to 1 and gradient accumulation is used. This is because there are other components that also require memory storage. Saving all activations from the forward pass in order to compute the gradients during the backward pass can result in significant memory overhead. The alternative approach of discarding the activations and recalculating them when needed during the backward pass, would introduce a considerable computational overhead and slow down the training process. **Gradient checkpointing** offers a compromise between these two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. For an in-depth explanation of gradient checkpointing, refer to [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9). To enable gradient checkpointing in the `Trainer`, pass the corresponding a flag to `TrainingArguments`: ```py training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args ) ``` Alternatively, use ๐Ÿค— Accelerate - find the ๐Ÿค— Accelerate example [further in this guide](#using--accelerate). While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%. ## Mixed precision training **Mixed precision training** is a technique that aims to optimize the computational efficiency of training models by utilizing lower-precision numerical formats for certain variables. Traditionally, most models use 32-bit floating point precision (fp32 or float32) to represent and process variables. However, not all variables require this high precision level to achieve accurate results. By reducing the precision of certain variables to lower numerical formats like 16-bit floating point (fp16 or float16), we can speed up the computations. Because in this approach some computations are performed in half-precision, while some are still in full precision, the approach is called mixed precision training. Most commonly mixed precision training is achieved by using fp16 (float16) data types, however, some GPU architectures (such as the Ampere architecture) offer bf16 and tf32 (CUDA internal data type) data types. Check out the [NVIDIA Blog](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/) to learn more about the differences between these data types. ### fp16 The main advantage of mixed precision training comes from saving the activations in half precision (fp16). Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). To enable mixed precision training, set the `fp16` flag to `True`: ```py training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) ``` If you prefer to use ๐Ÿค— Accelerate, find the ๐Ÿค— Accelerate example [further in this guide](#using--accelerate). ### BF16 If you have access to an Ampere or newer hardware you can use bf16 for mixed precision training and evaluation. While bf16 has a worse precision than fp16, it has a much bigger dynamic range. In fp16 the biggest number you can have is `65504` and any number above that will result in an overflow. A bf16 number can be as large as `3.39e+38` (!) which is about the same as fp32 - because both have 8-bits used for the numerical range. You can enable BF16 in the ๐Ÿค— Trainer with: ```python training_args = TrainingArguments(bf16=True, **default_args) ``` ### TF32 The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total. It's "magical" in the sense that you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput improvement. All you need to do is to add the following to your code: ```python import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True ``` CUDA will automatically switch to using tf32 instead of fp32 where possible, assuming that the used GPU is from the Ampere series. According to [NVIDIA research](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/), the majority of machine learning training workloads show the same perplexity and convergence with tf32 training as with fp32. If you're already using fp16 or bf16 mixed precision it may help with the throughput as well. You can enable this mode in the ๐Ÿค— Trainer: ```python TrainingArguments(tf32=True, **default_args) ``` tf32 can't be accessed directly via `tensor.to(dtype=torch.tf32)` because it is an internal CUDA data type. You need `torch>=1.7` to use tf32 data types. For additional information on tf32 vs other precisions, please refer to the following benchmarks: [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803) and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189). ## Flash Attention 2 You can speedup the training throughput by using Flash Attention 2 integration in transformers. Check out the appropriate section in the [single GPU section](./perf_infer_gpu_one#Flash-Attention-2) to learn more about how to load a model with Flash Attention 2 modules. ## Optimizer choice The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer. For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed for NVIDIA GPUs, or [ROCmSoftwarePlatform/apex](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs, `adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers. `Trainer` integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`, `adamw_apex_fused`, `adamw_anyprecision`, `adafactor`, or `adamw_bnb_8bit`. More optimizers can be plugged in via a third-party implementation. Let's take a closer look at two alternatives to AdamW optimizer: 1. `adafactor` which is available in `Trainer` 2. `adamw_bnb_8bit` is also available in Trainer, but a third-party integration is provided below for demonstration. For comparison, for a 3B-parameter model, like โ€œgoogle-t5/t5-3bโ€: * A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB) * Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra. * 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized. ### Adafactor Adafactor doesn't store rolling averages for each element in weight matrices. Instead, it keeps aggregated information (sums of rolling averages row- and column-wise), significantly reducing its footprint. However, compared to Adam, Adafactor may have slower convergence in certain cases. You can switch to Adafactor by setting `optim="adafactor"` in `TrainingArguments`: ```py training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args) ``` Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training) you can notice up to 3x improvement while maintaining the throughput! However, as mentioned before, the convergence of Adafactor can be worse than Adam. ### 8-bit Adam Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind mixed precision training. To use `adamw_bnb_8bit`, you simply need to set `optim="adamw_bnb_8bit"` in `TrainingArguments`: ```py training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args) ``` However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated. First, follow the installation guide in the GitHub [repo](https://github.com/bitsandbytes-foundation/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer. Next you need to initialize the optimizer. This involves two steps: * First, group the model's parameters into two groups - one where weight decay should be applied, and the other one where it should not. Usually, biases and layer norm parameters are not weight decayed. * Then do some argument housekeeping to use the same parameters as the previously used AdamW optimizer. ```py import bitsandbytes as bnb from torch import nn from transformers.trainer_pt_utils import get_parameter_names training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) decay_parameters = get_parameter_names(model, [nn.LayerNorm]) decay_parameters = [name for name in decay_parameters if "bias" not in name] optimizer_grouped_parameters = [ { "params": [p for n, p in model.named_parameters() if n in decay_parameters], "weight_decay": training_args.weight_decay, }, { "params": [p for n, p in model.named_parameters() if n not in decay_parameters], "weight_decay": 0.0, }, ] optimizer_kwargs = { "betas": (training_args.adam_beta1, training_args.adam_beta2), "eps": training_args.adam_epsilon, } optimizer_kwargs["lr"] = training_args.learning_rate adam_bnb_optim = bnb.optim.Adam8bit( optimizer_grouped_parameters, betas=(training_args.adam_beta1, training_args.adam_beta2), eps=training_args.adam_epsilon, lr=training_args.learning_rate, ) ``` Finally, pass the custom optimizer as an argument to the `Trainer`: ```py trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) ``` Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training), you can expect to get about a 3x memory improvement and even slightly higher throughput as using Adafactor. ### multi_tensor pytorch-nightly introduced `torch.optim._multi_tensor` which should significantly speed up the optimizers for situations with lots of small feature tensors. It should eventually become the default, but if you want to experiment with it sooner, take a look at this GitHub [issue](https://github.com/huggingface/transformers/issues/9965). ## Data preloading One of the important requirements to reach great training speed is the ability to feed the GPU at the maximum speed it can handle. By default, everything happens in the main process, and it might not be able to read the data from disk fast enough, and thus create a bottleneck, leading to GPU under-utilization. Configure the following arguments to reduce the bottleneck: - `DataLoader(pin_memory=True, ...)` - ensures the data gets preloaded into the pinned memory on CPU and typically leads to much faster transfers from CPU to GPU memory. - `DataLoader(num_workers=4, ...)` - spawn several workers to preload data faster. During training, watch the GPU utilization stats; if it's far from 100%, experiment with increasing the number of workers. Of course, the problem could be elsewhere, so many workers won't necessarily lead to better performance. When using `Trainer`, the corresponding `TrainingArguments` are: `dataloader_pin_memory` (`True` by default), and `dataloader_num_workers` (defaults to `0`). ## DeepSpeed ZeRO DeepSpeed is an open-source deep learning optimization library that is integrated with ๐Ÿค— Transformers and ๐Ÿค— Accelerate. It provides a wide range of features and optimizations designed to improve the efficiency and scalability of large-scale deep learning training. If your model fits onto a single GPU and you have enough space to fit a small batch size, you don't need to use DeepSpeed as it'll only slow things down. However, if the model doesn't fit onto a single GPU or you can't fit a small batch, you can leverage DeepSpeed ZeRO + CPU Offload, or NVMe Offload for much larger models. In this case, you need to separately [install the library](main_classes/deepspeed#installation), then follow one of the guides to create a configuration file and launch DeepSpeed: * For an in-depth guide on DeepSpeed integration with `Trainer`, review [the corresponding documentation](main_classes/deepspeed), specifically the [section for a single GPU](main_classes/deepspeed#deployment-with-one-gpu). Some adjustments are required to use DeepSpeed in a notebook; please take a look at the [corresponding guide](main_classes/deepspeed#deployment-in-notebooks). * If you prefer to use ๐Ÿค— Accelerate, refer to [๐Ÿค— Accelerate DeepSpeed guide](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed). ## Using torch.compile PyTorch 2.0 introduced a new compile function that doesn't require any modification to existing PyTorch code but can optimize your code by adding a single line of code: `model = torch.compile(model)`. If using `Trainer`, you only need `to` pass the `torch_compile` option in the `TrainingArguments`: ```python training_args = TrainingArguments(torch_compile=True, **default_args) ``` `torch.compile` uses Python's frame evaluation API to automatically create a graph from existing PyTorch programs. After capturing the graph, different backends can be deployed to lower the graph to an optimized engine. You can find more details and benchmarks in [PyTorch documentation](https://pytorch.org/get-started/pytorch-2.0/). `torch.compile` has a growing list of backends, which can be found in by calling `torchdynamo.list_backends()`, each of which with its optional dependencies. Choose which backend to use by specifying it via `torch_compile_backend` in the `TrainingArguments`. Some of the most commonly used backends are: **Debugging backends**: * `dynamo.optimize("eager")` - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues. * `dynamo.optimize("aot_eager")` - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups. **Training & inference backends**: * `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747) * `dynamo.optimize("nvfuser")` - nvFuser with TorchScript. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) * `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutograd. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) * `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757) **Inference-only backend**s: * `dynamo.optimize("ofi")` - Uses TorchScript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) * `dynamo.optimize("fx2trt")` - Uses NVIDIA TensorRT for inference optimizations. [Read more](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) * `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/) * `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch) For an example of using `torch.compile` with ๐Ÿค— Transformers, check out this [blog post on fine-tuning a BERT model for Text Classification using the newest PyTorch 2.0 features](https://www.philschmid.de/getting-started-pytorch-2-0-transformers) ## Using ๐Ÿค— PEFT [Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. As a result the [memory associated to the optimizer states and gradients](https://huggingface.co/docs/transformers/model_memory_anatomy#anatomy-of-models-memory) are greatly reduced. For example with a vanilla AdamW, the memory requirement for the optimizer state would be: * fp32 copy of parameters: 4 bytes/param * Momentum: 4 bytes/param * Variance: 4 bytes/param Suppose a model with 7B parameters and 200 million parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora). The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters). Adding Lora increases slightly the memory associated to the model weights and substantially decreases memory requirement for the optimizer state to 12 * 0.2 = 2.4GB. Read more about PEFT and its detailed usage in [the PEFT documentation](https://huggingface.co/docs/peft/) or [PEFT repository](https://github.com/huggingface/peft). ## Using ๐Ÿค— Accelerate With [๐Ÿค— Accelerate](https://huggingface.co/docs/accelerate/index) you can use the above methods while gaining full control over the training loop and can essentially write the loop in pure PyTorch with some minor modifications. Suppose you have combined the methods in the `TrainingArguments` like so: ```py training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, fp16=True, **default_args, ) ``` The full example training loop with ๐Ÿค— Accelerate is only a handful of lines of code long: ```py from accelerate import Accelerator from torch.utils.data.dataloader import DataLoader dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size) if training_args.gradient_checkpointing: model.gradient_checkpointing_enable() accelerator = Accelerator(fp16=training_args.fp16) model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader) model.train() for step, batch in enumerate(dataloader, start=1): loss = model(**batch).loss loss = loss / training_args.gradient_accumulation_steps accelerator.backward(loss) if step % training_args.gradient_accumulation_steps == 0: optimizer.step() optimizer.zero_grad() ``` First we wrap the dataset in a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Then we can enable gradient checkpointing by calling the model's `gradient_checkpointing_enable()` method. When we initialize the [`Accelerator`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator) we can specify if we want to use mixed precision training and it will take care of it for us in the `prepare` call. During the [`prepare`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.prepare) call the dataloader will also be distributed across workers should we use multiple GPUs. We use the same [8-bit optimizer](#8-bit-adam) from the earlier example. Finally, we can add the main training loop. Note that the `backward` call is handled by ๐Ÿค— Accelerate. We can also see how gradient accumulation works: we normalize the loss, so we get the average at the end of accumulation and once we have enough steps we run the optimization. Implementing these optimization techniques with ๐Ÿค— Accelerate only takes a handful of lines of code and comes with the benefit of more flexibility in the training loop. For a full documentation of all features have a look at the [Accelerate documentation](https://huggingface.co/docs/accelerate/index). ## Efficient Software Prebuilds PyTorch's [pip and conda builds](https://pytorch.org/get-started/locally/#start-locally) come prebuilt with the cuda toolkit which is enough to run PyTorch, but it is insufficient if you need to build cuda extensions. At times, additional efforts may be required to pre-build some components. For instance, if you're using libraries like `apex` that don't come pre-compiled. In other situations figuring out how to install the right cuda toolkit system-wide can be complicated. To address these scenarios PyTorch and NVIDIA released a new version of NGC docker container which already comes with everything prebuilt. You just need to install your programs on it, and it will run out of the box. This approach is also useful if you want to tweak the pytorch source and/or make a new customized build. To find the docker image version you want start [with PyTorch release notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/), choose one of the latest monthly releases. Go into the release's notes for the desired release, check that the environment's components are matching your needs (including NVIDIA Driver requirements!) and then at the very top of that document go to the corresponding NGC page. If for some reason you get lost, here is [the index of all PyTorch NGC images](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). Next follow the instructions to download and deploy the docker image. ## Mixture of Experts Some recent papers reported a 4-5x training speedup and a faster inference by integrating Mixture of Experts (MoE) into the Transformer models. Since it has been discovered that more parameters lead to better performance, this technique allows to increase the number of parameters by an order of magnitude without increasing training costs. In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function that trains each expert in a balanced way depending on the input token's position in a sequence. ![MoE Transformer 2x block](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf-moe-transformer.png) (source: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)) You can find exhaustive details and comparison tables in the papers listed at the end of this section. The main drawback of this approach is that it requires staggering amounts of GPU memory - almost an order of magnitude larger than its dense equivalent. Various distillation and approaches are proposed to how to overcome the much higher memory requirements. There is direct trade-off though, you can use just a few experts with a 2-3x smaller base model instead of dozens or hundreds experts leading to a 5x smaller model and thus increase the training speed moderately while increasing the memory requirements moderately as well. Most related papers and implementations are built around Tensorflow/TPUs: - [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) - [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) - [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) And for Pytorch DeepSpeed has built one as well: [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596), [Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - blog posts: [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/), [2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/) and specific deployment with large transformer-based natural language generation models: [blog post](https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html), [Megatron-Deepspeed branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training). ## Using PyTorch native attention and Flash Attention PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available. Please refer to [PyTorch scaled dot product attention](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) for a list of supported models and more details. Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA. # How to Hack Any Transformers Model The [๐Ÿค— Transformers](https://github.com/huggingface/transformers) library offers a collection of pre-trained models and tools for natural language processing, vision, and beyond. While these models cover a wide range of applications, you might encounter use cases that aren't supported out of the box. Customizing models can unlock new possibilities, such as adding new layers, altering architectures, or optimizing attention mechanisms. This guide will show you how to modify existing Transformers models to fit your specific needs. The great thing is, you donโ€™t have to step away from the Transformers framework to make these changes. You can actually modify models directly in Transformers and still take advantage of features like the [Trainer API](https://huggingface.co/docs/transformers/main/en/main_classes/trainer), [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel), and efficient fine-tuning with tools like [PEFT](https://huggingface.co/docs/peft/index). In this guide, weโ€™ll walk you through how to customize existing Transformers models to meet your requirementsโ€”without losing the benefits of the ecosystem. You'll learn how to: - Modify a model's architecture by changing its attention mechanism. - Apply techniques like Low-Rank Adaptation (LoRA) to specific model components. We encourage you to contribute your own hacks and share them here with the community1 ## Example: Modifying the Attention Mechanism in the Segment Anything Model (SAM) The **Segment Anything Model (SAM)** is a state-of-the-art model for image segmentation. In its default implementation, SAM uses a combined query-key-value (`qkv`) projection in its attention mechanism. However, you might want to fine-tune only specific components of the attention mechanism, such as the query (`q`) and value (`v`) projections, to reduce the number of trainable parameters and computational resources required. ### Motivation By splitting the combined `qkv` projection into separate `q`, `k`, and `v` projections, you can apply techniques like **LoRA** (Low-Rank Adaptation) to only the `q` and `v` projections. This approach allows you to: - Fine-tune fewer parameters, reducing computational overhead. - Potentially achieve better performance by focusing on specific components. - Experiment with different adaptation strategies in the attention mechanism. ### Implementation #### **Step 1: Create a Custom Attention Class** Next, subclass the original `SamVisionAttention` class and modify it to have separate `q`, `k`, and `v` projections. ```python import torch import torch.nn as nn from transformers.models.sam.modeling_sam import SamVisionAttention class SamVisionAttentionSplit(SamVisionAttention, nn.Module): def __init__(self, config, window_size): super().__init__(config, window_size) del self.qkv # Separate q, k, v projections self.q = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias) self.k = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias) self.v = nn.Linear(config.hidden_size, config.hidden_size, bias=config.qkv_bias) self._register_load_state_dict_pre_hook(self.split_q_k_v_load_hook) def split_q_k_v_load_hook(self, state_dict, prefix, *args): keys_to_delete = [] for key in list(state_dict.keys()): if "qkv." in key: # Split q, k, v from the combined projection q, k, v = state_dict[key].chunk(3, dim=0) # Replace with individual q, k, v projections state_dict[key.replace("qkv.", "q.")] = q state_dict[key.replace("qkv.", "k.")] = k state_dict[key.replace("qkv.", "v.")] = v # Mark the old qkv key for deletion keys_to_delete.append(key) # Remove old qkv keys for key in keys_to_delete: del state_dict[key] def forward(self, hidden_states: torch.Tensor, output_attentions=False) -> torch.Tensor: batch_size, height, width, _ = hidden_states.shape qkv_shapes = (batch_size * self.num_attention_heads, height * width, -1) query = self.q(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes) key = self.k(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes) value = self.v(hidden_states).reshape((batch_size, height * width,self.num_attention_heads, -1)).permute(0,2,1,3).reshape(qkv_shapes) attn_weights = (query * self.scale) @ key.transpose(-2, -1) if self.use_rel_pos: attn_weights = self.add_decomposed_rel_pos( attn_weights, query, self.rel_pos_h, self.rel_pos_w, (height, width), (height, width) ) attn_weights = torch.nn.functional.softmax(attn_weights, dtype=torch.float32, dim=-1).to(query.dtype) attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) attn_output = (attn_probs @ value).reshape(batch_size, self.num_attention_heads, height, width, -1) attn_output = attn_output.permute(0, 2, 3, 1, 4).reshape(batch_size, height, width, -1) attn_output = self.proj(attn_output) if output_attentions: outputs = (attn_output, attn_weights) else: outputs = (attn_output, None) return outputs ``` **Explanation:** - **Separate Projections:** The combined `qkv` projection is removed, and separate `q`, `k`, and `v` linear layers are created. - **Weight Loading Hook:** The `_split_qkv_load_hook` method splits the pre-trained `qkv` weights into separate `q`, `k`, and `v` weights when loading the model. This ensures compatibility with any pre-trained model. - **Forward Pass:** Queries, keys, and values are computed separately, and the attention mechanism proceeds as usual. #### **Step 2: Replace the Original Attention Class** Replace the original `SamVisionAttention` class with your custom class so that the model uses the modified attention mechanism. ```python from transformers import SamModel from transformers.models.sam import modeling_sam # Replace the attention class in the modeling_sam module modeling_sam.SamVisionAttention = SamVisionAttentionSplit # Load the pre-trained SAM model model = SamModel.from_pretrained("facebook/sam-vit-base") ``` **Explanation:** - **Class Replacement:** By assigning your custom class to `modeling_sam.SamVisionAttention`, any instances of `SamVisionAttention` in the model will use the modified version. Thus when you call `SamModel`, it will use the newly defined `SamVisionAttentionSplit`. - **Model Loading:** The model is loaded using `from_pretrained`, and the custom attention mechanism is integrated. #### **Step 3: Apply LoRA to Specific Projections** With separate `q`, `k`, and `v` projections, you can now apply LoRA to specific components, such as the `q` and `v` projections. ```python from peft import LoraConfig, get_peft_model config = LoraConfig( r=16, lora_alpha=32, target_modules=["q", "v"], # Apply LoRA to q and v projections lora_dropout=0.1, task_type="mask-generation" ) # Apply LoRA to the model model = get_peft_model(model, config) ``` **Explanation:** - **LoRA Configuration:** The `LoraConfig` specifies the rank `r`, scaling factor `lora_alpha`, target modules (`"q"` and `"v"`), dropout, and task type. - **Applying LoRA:** The `get_peft_model` function applies LoRA to the specified modules in the model. - **Parameter Reduction:** By focusing on `q` and `v`, you reduce the number of trainable parameters, leading to faster training and lower memory usage. #### **Step 4: Verify the Number of Trainable Parameters** It's simple to verify the number of trainable parameters and see what impact your modification had. ```python model.print_trainable_parameters() ``` **Expected Output:** ``` trainable params: 608,256 || all params: 94,343,728 || trainable%: 0.6447 trainable params: 912,384 || all params: 94,647,856 || trainable%: 0.9640 # with k ``` ## Contributing Your Own Hacks Modifying pre-trained models can open up new avenues for research and application. By understanding and adjusting the internal mechanisms of models like SAM, you can tailor them to your specific needs, optimize performance, and experiment with new ideas. If you've developed your own hacks for Transformers models and would like to share them, consider contributing to this doc. - **Open a Pull Request:** Share your code changes and improvements directly in the repository. - **Write Documentation:** Provide clear explanations and examples of your modifications. - **Engage with the Community:** Discuss your ideas and get feedback from other developers and researchers by opening an issue. # Philosophy ๐Ÿค— Transformers is an opinionated library built for: - machine learning researchers and educators seeking to use, study or extend large-scale Transformers models. - hands-on practitioners who want to fine-tune those models or serve them in production, or both. - engineers who just want to download a pretrained model and use it to solve a given machine learning task. The library was designed with two strong goals in mind: 1. Be as easy and fast to use as possible: - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: [configuration](main_classes/configuration), [models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [image processor](main_classes/image_processor) for vision, [feature extractor](main_classes/feature_extractor) for audio, and [processor](main_classes/processors) for multimodal inputs). - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` method which downloads (if needed), caches and loads the related class instance and associated data (configurations' hyperparameters, tokenizers' vocabulary, and models' weights) from a pretrained checkpoint provided on [Hugging Face Hub](https://huggingface.co/models) or your own saved checkpoint. - On top of those three base classes, the library provides two APIs: `pipeline()` for quickly using a model for inference on a given task and `Trainer` to quickly train or fine-tune a PyTorch model (all TensorFlow models are compatible with `Keras.fit`). - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend or build upon the library, just use regular Python, PyTorch, TensorFlow, Keras modules and inherit from the base classes of the library to reuse functionalities like model loading and saving. If you'd like to learn more about our coding philosophy for models, check out our [Repeat Yourself](https://huggingface.co/blog/transformers-design-philosophy) blog post. 2. Provide state-of-the-art models with performances as close as possible to the original models: - We provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture. - The code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code and vice versa. A few other goals: - Expose the models' internals as consistently as possible: - We give access, using a single API, to the full hidden-states and attention weights. - The preprocessing classes and base model APIs are standardized to easily switch between models. - Incorporate a subjective selection of promising tools for fine-tuning and investigating these models: - A simple and consistent way to add new tokens to the vocabulary and embeddings for fine-tuning. - Simple ways to mask and prune Transformer heads. - Easily switch between PyTorch, TensorFlow 2.0 and Flax, allowing training with one framework and inference with another. ## Main concepts The library is built around three types of classes for each model: - **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html)) that work with the pretrained weights provided in the library. - **Configuration classes** store the hyperparameters required to build a model (such as the number of layers and hidden size). You don't always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model). - **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Image processors](main_classes/image_processor) preprocess vision inputs, [feature extractors](main_classes/feature_extractor) preprocess audio inputs, and a [processor](main_classes/processors) handles multimodal inputs. All these classes can be instantiated from pretrained instances, saved locally, and shared on the Hub with three methods: - `from_pretrained()` lets you instantiate a model, configuration, and preprocessing class from a pretrained version either provided by the library itself (the supported models can be found on the [Model Hub](https://huggingface.co/models)) or stored locally (or on a server) by the user. - `save_pretrained()` lets you save a model, configuration, and preprocessing class locally so that it can be reloaded using `from_pretrained()`. - `push_to_hub()` lets you share a model, configuration, and a preprocessing class to the Hub, so it is easily accessible to everyone. # Fine-tune a pretrained model There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. ๐Ÿค— Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice: * Fine-tune a pretrained model with ๐Ÿค— Transformers `Trainer`. * Fine-tune a pretrained model in TensorFlow with Keras. * Fine-tune a pretrained model in native PyTorch. ## Prepare a dataset Before you can fine-tune a pretrained model, download a dataset and prepare it for training. The previous tutorial showed you how to process data for training, and now you get an opportunity to put those skills to the test! Begin by loading the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset: ```py >>> from datasets import load_dataset >>> dataset = load_dataset("yelp_review_full") >>> dataset["train"][100] {'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} ``` As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use ๐Ÿค— Datasets [`map`](https://huggingface.co/docs/datasets/process#map) method to apply a preprocessing function over the entire dataset: ```py >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): ... return tokenizer(examples["text"], padding="max_length", truncation=True) >>> tokenized_datasets = dataset.map(tokenize_function, batched=True) ``` If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes: ```py >>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) >>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) ``` ## Train At this point, you should follow the section corresponding to the framework you want to use. You can use the links in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework, just use the button at the top-right of that framework's block! ## Train with PyTorch Trainer ๐Ÿค— Transformers provides a `Trainer` class optimized for training ๐Ÿค— Transformers models, making it easier to start training without manually writing your own training loop. The `Trainer` API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision. Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels: ```py >>> from transformers import AutoModelForSequenceClassification >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it. ### Training hyperparameters Next, create a `TrainingArguments` class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training [hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments), but feel free to experiment with these to find your optimal settings. Specify where to save the checkpoints from your training: ```py >>> from transformers import TrainingArguments >>> training_args = TrainingArguments(output_dir="test_trainer") ``` ### Evaluate `Trainer` does not automatically evaluate model performance during training. You'll need to pass `Trainer` a function to compute and report metrics. The [๐Ÿค— Evaluate](https://huggingface.co/docs/evaluate/index) library provides a simple [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) function you can load with the [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load) (see this [quicktour](https://huggingface.co/docs/evaluate/a_quick_tour) for more information) function: ```py >>> import numpy as np >>> import evaluate >>> metric = evaluate.load("accuracy") ``` Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the logits to predictions (remember all ๐Ÿค— Transformers models return logits): ```py >>> def compute_metrics(eval_pred): ... logits, labels = eval_pred ... predictions = np.argmax(logits, axis=-1) ... return metric.compute(predictions=predictions, references=labels) ``` If you'd like to monitor your evaluation metrics during fine-tuning, specify the `eval_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch: ```py >>> from transformers import TrainingArguments, Trainer >>> training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch") ``` ### Trainer Create a `Trainer` object with your model, training arguments, training and test datasets, and evaluation function: ```py >>> trainer = Trainer( ... model=model, ... args=training_args, ... train_dataset=small_train_dataset, ... eval_dataset=small_eval_dataset, ... compute_metrics=compute_metrics, ... ) ``` Then fine-tune your model by calling `train()`: ```py >>> trainer.train() ``` ## Train in native PyTorch `Trainer` takes care of the training loop and allows you to fine-tune a model in a single line of code. For users who prefer to write their own training loop, you can also fine-tune a ๐Ÿค— Transformers model in native PyTorch. At this point, you may need to restart your notebook or execute the following code to free some memory: ```py del model del trainer torch.cuda.empty_cache() ``` Next, manually postprocess `tokenized_dataset` to prepare it for training. 1. Remove the `text` column because the model does not accept raw text as an input: ```py >>> tokenized_datasets = tokenized_datasets.remove_columns(["text"]) ``` 2. Rename the `label` column to `labels` because the model expects the argument to be named `labels`: ```py >>> tokenized_datasets = tokenized_datasets.rename_column("label", "labels") ``` 3. Set the format of the dataset to return PyTorch tensors instead of lists: ```py >>> tokenized_datasets.set_format("torch") ``` Then create a smaller subset of the dataset as previously shown to speed up the fine-tuning: ```py >>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) >>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) ``` ### DataLoader Create a `DataLoader` for your training and test datasets so you can iterate over batches of data: ```py >>> from torch.utils.data import DataLoader >>> train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8) >>> eval_dataloader = DataLoader(small_eval_dataset, batch_size=8) ``` Load your model with the number of expected labels: ```py >>> from transformers import AutoModelForSequenceClassification >>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Optimizer and learning rate scheduler Create an optimizer and learning rate scheduler to fine-tune the model. Let's use the [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer from PyTorch: ```py >>> from torch.optim import AdamW >>> optimizer = AdamW(model.parameters(), lr=5e-5) ``` Create the default learning rate scheduler from `Trainer`: ```py >>> from transformers import get_scheduler >>> num_epochs = 3 >>> num_training_steps = num_epochs * len(train_dataloader) >>> lr_scheduler = get_scheduler( ... name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps ... ) ``` Lastly, specify `device` to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes. ```py >>> import torch >>> device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") >>> model.to(device) ``` Get free access to a cloud GPU if you don't have one with a hosted notebook like [Colaboratory](https://colab.research.google.com/) or [SageMaker StudioLab](https://studiolab.sagemaker.aws/). Great, now you are ready to train! ๐Ÿฅณ ### Training loop To keep track of your training progress, use the [tqdm](https://tqdm.github.io/) library to add a progress bar over the number of training steps: ```py >>> from tqdm.auto import tqdm >>> progress_bar = tqdm(range(num_training_steps)) >>> model.train() >>> for epoch in range(num_epochs): ... for batch in train_dataloader: ... batch = {k: v.to(device) for k, v in batch.items()} ... outputs = model(**batch) ... loss = outputs.loss ... loss.backward() ... optimizer.step() ... lr_scheduler.step() ... optimizer.zero_grad() ... progress_bar.update(1) ``` ### Evaluate Just like how you added an evaluation function to `Trainer`, you need to do the same when you write your own training loop. But instead of calculating and reporting the metric at the end of each epoch, this time you'll accumulate all the batches with `add_batch` and calculate the metric at the very end. ```py >>> import evaluate >>> metric = evaluate.load("accuracy") >>> model.eval() >>> for batch in eval_dataloader: ... batch = {k: v.to(device) for k, v in batch.items()} ... with torch.no_grad(): ... outputs = model(**batch) ... logits = outputs.logits ... predictions = torch.argmax(logits, dim=-1) ... metric.add_batch(predictions=predictions, references=batch["labels"]) >>> metric.compute() ``` ## Additional resources For more fine-tuning examples, refer to: - [๐Ÿค— Transformers Examples](https://github.com/huggingface/transformers/tree/main/examples) includes scripts to train common NLP tasks in PyTorch and TensorFlow. - [๐Ÿค— Transformers Notebooks](notebooks) contains various notebooks on how to fine-tune a model for specific tasks in PyTorch and TensorFlow. # Chat Templates ## Introduction An increasingly common use case for LLMs is **chat**. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more **messages**, each of which includes a **role**, like "user" or "assistant", as well as message text. Much like tokenization, different models expect very different input formats for chat. This is the reason we added **chat templates** as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects. Let's make this concrete with a quick example using the `mistralai/Mistral-7B-Instruct-v0.1` model: ```python >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") >>> chat = [ ... {"role": "user", "content": "Hello, how are you?"}, ... {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, ... {"role": "user", "content": "I'd like to show off how chat templating works!"}, ... ] >>> tokenizer.apply_chat_template(chat, tokenize=False) "[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" ``` Notice how the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of user messages (but not assistant messages!), and the entire chat is condensed into a single string. If we use `tokenize=True`, which is the default setting, that string will also be tokenized for us. Now, try the same code, but swap in the `HuggingFaceH4/zephyr-7b-beta` model instead, and you should get: ```text <|user|> Hello, how are you? <|assistant|> I'm doing great. How can I help you today? <|user|> I'd like to show off how chat templating works! ``` Both Zephyr and Mistral-Instruct were fine-tuned from the same base model, `Mistral-7B-v0.1`. However, they were trained with totally different chat formats. Without chat templates, you would have to write manual formatting code for each model, and it's very easy to make minor errors that hurt performance! Chat templates handle the details of formatting for you, allowing you to write universal code that works for any model. ## How do I use chat templates? As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with `role` and `content` keys, and then pass it to the `apply_chat_template()` method. Once you do that, you'll get output that's ready to go! When using chat templates as input for model generation, it's also a good idea to use `add_generation_prompt=True` to add a [generation prompt](#what-are-generation-prompts). Here's an example of preparing input for `model.generate()`, using `Zephyr` again: ```python from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "HuggingFaceH4/zephyr-7b-beta" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here messages = [ { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", }, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") print(tokenizer.decode(tokenized_chat[0])) ``` This will yield a string in the input format that Zephyr expects. ```text <|system|> You are a friendly chatbot who always responds in the style of a pirate <|user|> How many helicopters can a human eat in one sitting? <|assistant|> ``` Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: ```python outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` This will yield: ```text <|system|> You are a friendly chatbot who always responds in the style of a pirate <|user|> How many helicopters can a human eat in one sitting? <|assistant|> Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all. ``` Arr, 'twas easy after all! ## Is there an automated pipeline for chat? Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. In the past, we used to use a dedicated "ConversationalPipeline" class, but this has now been deprecated and its functionality has been merged into the `TextGenerationPipeline`. Let's try the `Zephyr` example again, but this time using a pipeline: ```python from transformers import pipeline pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta") messages = [ { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", }, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ] print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response ``` ```text {'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."} ``` The pipeline will take care of all the details of tokenization and calling `apply_chat_template` for you - once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages! ## What are "generation prompts"? You may have noticed that the `apply_chat_template` method has an `add_generation_prompt` argument. This argument tells the template to add tokens that indicate the start of a bot response. For example, consider the following chat: ```python messages = [ {"role": "user", "content": "Hi there!"}, {"role": "assistant", "content": "Nice to meet you!"}, {"role": "user", "content": "Can I ask a question?"} ] ``` Here's what this will look like without a generation prompt, for a model that uses standard "ChatML" formatting: ```python tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) """<|im_start|>user Hi there!<|im_end|> <|im_start|>assistant Nice to meet you!<|im_end|> <|im_start|>user Can I ask a question?<|im_end|> """ ``` And here's what it looks like **with** a generation prompt: ```python tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) """<|im_start|>user Hi there!<|im_end|> <|im_start|>assistant Nice to meet you!<|im_end|> <|im_start|>user Can I ask a question?<|im_end|> <|im_start|>assistant """ ``` Note that this time, we've added the tokens that indicate the start of a bot response. This ensures that when the model generates text it will write a bot response instead of doing something unexpected, like continuing the user's message. Remember, chat models are still just language models - they're trained to continue text, and chat is just a special kind of text to them! You need to guide them with appropriate control tokens, so they know what they're supposed to be doing. Not all models require generation prompts. Some models, like LLaMA, don't have any special tokens before bot responses. In these cases, the `add_generation_prompt` argument will have no effect. The exact effect that `add_generation_prompt` has will depend on the template being used. ## What does "continue_final_message" do? When passing a list of messages to `apply_chat_template` or `TextGenerationPipeline`, you can choose to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply extend the final message when it begins to generate text. This is useful for "prefilling" the model's response. Here's an example: ```python chat = [ {"role": "user", "content": "Can you format the answer in JSON?"}, {"role": "assistant", "content": '{"name": "'}, ] formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True) model.generate(**formatted_chat) ``` The model will generate text that continues the JSON string, rather than starting a new message. This approach can be very useful for improving the accuracy of the model's instruction-following when you know how you want it to start its replies. Because `add_generation_prompt` adds the tokens that start a new message, and `continue_final_message` removes any end-of-message tokens from the final message, it does not make sense to use them together. As a result, you'll get an error if you try! The default behaviour of `TextGenerationPipeline` is to set `add_generation_prompt=True` so that it starts a new message. However, if the final message in the input chat has the "assistant" role, it will assume that this message is a prefill and switch to `continue_final_message=True` instead, because most models do not support multiple consecutive assistant messages. You can override this behaviour by explicitly passing the `continue_final_message` argument when calling the pipeline. ## Can I use chat templates in training? Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training. We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set `add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training. Let's see an example: ```python from transformers import AutoTokenizer from datasets import Dataset tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") chat1 = [ {"role": "user", "content": "Which is bigger, the moon or the sun?"}, {"role": "assistant", "content": "The sun."} ] chat2 = [ {"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, {"role": "assistant", "content": "A bacterium."} ] dataset = Dataset.from_dict({"chat": [chat1, chat2]}) dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) print(dataset['formatted_chat'][0]) ``` And we get: ```text <|user|> Which is bigger, the moon or the sun? <|assistant|> The sun. ``` From here, just continue training like you would with a standard language modelling task, using the `formatted_chat` column. By default, some tokenizers add special tokens like `` and `` to text they tokenize. Chat templates should already include all the special tokens they need, and so additional special tokens will often be incorrect or duplicated, which will hurt model performance. Therefore, if you format text with `apply_chat_template(tokenize=False)`, you should set the argument `add_special_tokens=False` when you tokenize that text later. If you use `apply_chat_template(tokenize=True)`, you don't need to worry about this! ## Advanced: Extra inputs to chat templates The only argument that `apply_chat_template` requires is `messages`. However, you can pass any keyword argument to `apply_chat_template` and it will be accessible inside the template. This gives you a lot of freedom to use chat templates for many things. There are no restrictions on the names or the format of these arguments - you can pass strings, lists, dicts or whatever else you want. That said, there are some common use-cases for these extra arguments, such as passing tools for function calling, or documents for retrieval-augmented generation. In these common cases, we have some opinionated recommendations about what the names and formats of these arguments should be, which are described in the sections below. We encourage model authors to make their chat templates compatible with this format, to make it easy to transfer tool-calling code between models. ## Advanced: Tool use / function calling "Tool use" LLMs can choose to call functions as external tools before generating an answer. When passing tools to a tool-use model, you can simply pass a list of functions to the `tools` argument: ```python import datetime def current_time(): """Get the current local time as a string.""" return str(datetime.now()) def multiply(a: float, b: float): """ A function that multiplies two numbers Args: a: The first number to multiply b: The second number to multiply """ return a * b tools = [current_time, multiply] model_input = tokenizer.apply_chat_template( messages, tools=tools ) ``` In order for this to work correctly, you should write your functions in the format above, so that they can be parsed correctly as tools. Specifically, you should follow these rules: - The function should have a descriptive name - Every argument must have a type hint - The function must have a docstring in the standard Google style (in other words, an initial function description followed by an `Args:` block that describes the arguments, unless the function does not have any arguments. - Do not include types in the `Args:` block. In other words, write `a: The first number to multiply`, not `a (int): The first number to multiply`. Type hints should go in the function header instead. - The function can have a return type and a `Returns:` block in the docstring. However, these are optional because most tool-use models ignore them. ### Passing tool results to the model The sample code above is enough to list the available tools for your model, but what happens if it wants to actually use one? If that happens, you should: 1. Parse the model's output to get the tool name(s) and arguments. 2. Add the model's tool call(s) to the conversation. 3. Call the corresponding function(s) with those arguments. 4. Add the result(s) to the conversation ### A complete tool use example Let's walk through a tool use example, step by step. For this example, we will use an 8B `Hermes-2-Pro` model, as it is one of the highest-performing tool-use models in its size category at the time of writing. If you have the memory, you can consider using a larger model instead like [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01) or [Mixtral-8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1), both of which also support tool use and offer even stronger performance. First, let's load our model and tokenizer: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") ``` Next, let's define a list of tools: ```python def get_current_temperature(location: str, unit: str) -> float: """ Get the current temperature at a location. Args: location: The location to get the temperature for, in the format "City, Country" unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"]) Returns: The current temperature at the specified location in the specified units, as a float. """ return 22. # A real function should probably actually get the temperature! def get_current_wind_speed(location: str) -> float: """ Get the current wind speed in km/h at a given location. Args: location: The location to get the temperature for, in the format "City, Country" Returns: The current wind speed at the given location in km/h, as a float. """ return 6. # A real function should probably actually get the wind speed! tools = [get_current_temperature, get_current_wind_speed] ``` Now, let's set up a conversation for our bot: ```python messages = [ {"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."}, {"role": "user", "content": "Hey, what's the temperature in Paris right now?"} ] ``` Now, let's apply the chat template and generate a response: ```python inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt") inputs = {k: v.to(model.device) for k, v in inputs.items()} out = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) ``` And we get: ```text {"arguments": {"location": "Paris, France", "unit": "celsius"}, "name": "get_current_temperature"} <|im_end|> ``` The model has called the function with valid arguments, in the format requested by the function docstring. It has inferred that we're most likely referring to the Paris in France, and it remembered that, as the home of SI units, the temperature in France should certainly be displayed in Celsius. The output format above is specific to the `Hermes-2-Pro` model we're using in this example. Other models may emit different tool call formats, and you may need to do some manual parsing at this step. For example, `Llama-3.1` models will emit slightly different JSON, with `parameters` instead of `arguments`. Regardless of the format the model outputs, you should add the tool call to the conversation in the format below, with `tool_calls`, `function` and `arguments` keys. Next, let's append the model's tool call to the conversation. ```python tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}} messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]}) ``` If you're familiar with the OpenAI API, you should pay attention to an important difference here - the `tool_call` is a dict, but in the OpenAI API it's a JSON string. Passing a string may cause errors or strange model behaviour! Now that we've added the tool call to the conversation, we can call the function and append the result to the conversation. Since we're just using a dummy function for this example that always returns 22.0, we can just append that result directly. ```python messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"}) ``` Some model architectures, notably Mistral/Mixtral, also require a `tool_call_id` here, which should be 9 randomly-generated alphanumeric characters, and assigned to the `id` key of the tool call dictionary. The same key should also be assigned to the `tool_call_id` key of the tool response dictionary below, so that tool calls can be matched to tool responses. So, for Mistral/Mixtral models, the code above would be: ```python tool_call_id = "9Ae3bDc2F" # Random ID, 9 alphanumeric characters tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}} messages.append({"role": "assistant", "tool_calls": [{"type": "function", "id": tool_call_id, "function": tool_call}]}) ``` and ```python messages.append({"role": "tool", "tool_call_id": tool_call_id, "name": "get_current_temperature", "content": "22.0"}) ``` Finally, let's let the assistant read the function outputs and continue chatting with the user: ```python inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt") inputs = {k: v.to(model.device) for k, v in inputs.items()} out = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) ``` And we get: ```text The current temperature in Paris, France is 22.0 ยฐ Celsius.<|im_end|> ``` Although this was a simple demo with dummy tools and a single call, the same technique works with multiple real tools and longer conversations. This can be a powerful way to extend the capabilities of conversational agents with real-time information, computational tools like calculators, or access to large databases. ### Understanding tool schemas Each function you pass to the `tools` argument of `apply_chat_template` is converted into a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step). These schemas are then passed to the model chat template. In other words, tool-use models do not see your functions directly, and they never see the actual code inside them. What they care about is the function **definitions** and the **arguments** they need to pass to them - they care about what the tools do and how to use them, not how they work! It is up to you to read their outputs, detect if they have requested to use a tool, pass their arguments to the tool function, and return the response in the chat. Generating JSON schemas to pass to the template should be automatic and invisible as long as your functions follow the specification above, but if you encounter problems, or you simply want more control over the conversion, you can handle the conversion manually. Here is an example of a manual schema conversion. ```python from transformers.utils import get_json_schema def multiply(a: float, b: float): """ A function that multiplies two numbers Args: a: The first number to multiply b: The second number to multiply """ return a * b schema = get_json_schema(multiply) print(schema) ``` This will yield: ```json { "type": "function", "function": { "name": "multiply", "description": "A function that multiplies two numbers", "parameters": { "type": "object", "properties": { "a": { "type": "number", "description": "The first number to multiply" }, "b": { "type": "number", "description": "The second number to multiply" } }, "required": ["a", "b"] } } } ``` If you wish, you can edit these schemas, or even write them from scratch yourself without using `get_json_schema` at all. JSON schemas can be passed directly to the `tools` argument of `apply_chat_template` - this gives you a lot of power to define precise schemas for more complex functions. Be careful, though - the more complex your schemas, the more likely the model is to get confused when dealing with them! We recommend simple function signatures where possible, keeping arguments (and especially complex, nested arguments) to a minimum. Here is an example of defining schemas by hand, and passing them directly to `apply_chat_template`: ```python # A simple function that takes no arguments current_time = { "type": "function", "function": { "name": "current_time", "description": "Get the current local time as a string.", "parameters": { 'type': 'object', 'properties': {} } } } # A more complete function that takes two numerical arguments multiply = { 'type': 'function', 'function': { 'name': 'multiply', 'description': 'A function that multiplies two numbers', 'parameters': { 'type': 'object', 'properties': { 'a': { 'type': 'number', 'description': 'The first number to multiply' }, 'b': { 'type': 'number', 'description': 'The second number to multiply' } }, 'required': ['a', 'b'] } } } model_input = tokenizer.apply_chat_template( messages, tools = [current_time, multiply] ) ``` ## Advanced: Retrieval-augmented generation "Retrieval-augmented generation" or "RAG" LLMs can search a corpus of documents for information before responding to a query. This allows models to vastly expand their knowledge base beyond their limited context size. Our recommendation for RAG models is that their template should accept a `documents` argument. This should be a list of documents, where each "document" is a single dict with `title` and `contents` keys, both of which are strings. Because this format is much simpler than the JSON schemas used for tools, no helper functions are necessary. Here's an example of a RAG template in action: ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load the model and tokenizer model_id = "CohereForAI/c4ai-command-r-v01-4bit" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") device = model.device # Get the device the model is loaded on # Define conversation input conversation = [ {"role": "user", "content": "What has Man always dreamed of?"} ] # Define documents for retrieval-based generation documents = [ { "title": "The Moon: Our Age-Old Foe", "text": "Man has always dreamed of destroying the moon. In this essay, I shall..." }, { "title": "The Sun: Our Age-Old Friend", "text": "Although often underappreciated, the sun provides several notable benefits..." } ] # Tokenize conversation and documents using a RAG template, returning PyTorch tensors. input_ids = tokenizer.apply_chat_template( conversation=conversation, documents=documents, chat_template="rag", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device) # Generate a response gen_tokens = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.3, ) # Decode and print the generated text along with generation prompt gen_text = tokenizer.decode(gen_tokens[0]) print(gen_text) ``` The `documents` input for retrieval-augmented generation is not widely supported, and many models have chat templates which simply ignore this input. To verify if a model supports the `documents` input, you can read its model card, or `print(tokenizer.chat_template)` to see if the `documents` key is used anywhere. One model class that does support it, though, is Cohere's [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-08-2024) and [Command-R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024), through their `rag` chat template. You can see additional examples of grounded generation using this feature in their model cards. ## Advanced: How do chat templates work? The chat template for a model is stored on the `tokenizer.chat_template` attribute. If no chat template is set, the default template for that model class is used instead. Let's take a look at a `Zephyr` chat template, though note this one is a little simplified from the actual one! ``` {%- for message in messages %} {{- '<|' + message['role'] + |>\n' }} {{- message['content'] + eos_token }} {%- endfor %} {%- if add_generation_prompt %} {{- '<|assistant|>\n' }} {%- endif %} ``` If you've never seen one of these before, this is a [Jinja template](https://jinja.palletsprojects.com/en/3.1.x/templates/). Jinja is a templating language that allows you to write simple code that generates text. In many ways, the code and syntax resembles Python. In pure Python, this template would look something like this: ```python for message in messages: print(f'<|{message["role"]}|>') print(message['content'] + eos_token) if add_generation_prompt: print('<|assistant|>') ``` Effectively, the template does three things: 1. For each message, print the role enclosed in `<|` and `|>`, like `<|user|>` or `<|assistant|>`. 2. Next, print the content of the message, followed by the end-of-sequence token. 3. Finally, if `add_generation_prompt` is set, print the assistant token, so that the model knows to start generating an assistant response. This is a pretty simple template but Jinja gives you a lot of flexibility to do more complex things! Let's see a Jinja template that can format inputs similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system messages and slightly different system message handling in general - don't use this one in your actual code!) ``` {%- for message in messages %} {%- if message['role'] == 'user' %} {{- bos_token + '[INST] ' + message['content'] + ' [/INST]' }} {%- elif message['role'] == 'system' %} {{- '<>\\n' + message['content'] + '\\n<>\\n\\n' }} {%- elif message['role'] == 'assistant' %} {{- ' ' + message['content'] + ' ' + eos_token }} {%- endif %} {%- endfor %} ``` Hopefully if you stare at this for a little bit you can see what this template is doing - it adds specific tokens like `[INST]` and `[/INST]` based on the role of each message. User, assistant and system messages are clearly distinguishable to the model because of the tokens they're wrapped in. ## Advanced: Adding and editing chat templates ### How do I create a chat template? Simple, just write a jinja template and set `tokenizer.chat_template`. You may find it easier to start with an existing template from another model and simply edit it for your needs! For example, we could take the LLaMA template above and add "[ASST]" and "[/ASST]" to assistant messages: ``` {%- for message in messages %} {%- if message['role'] == 'user' %} {{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} {%- elif message['role'] == 'system' %} {{- '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} {%- elif message['role'] == 'assistant' %} {{- '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} {%- endif %} {%- endfor %} ``` Now, simply set the `tokenizer.chat_template` attribute. Next time you use `apply_chat_template()`, it will use your new template! This attribute will be saved in the `tokenizer_config.json` file, so you can use `push_to_hub()` to upload your new template to the Hub and make sure everyone's using the right template for your model! ```python template = tokenizer.chat_template template = template.replace("SYS", "SYSTEM") # Change the system token tokenizer.chat_template = template # Set the new template tokenizer.push_to_hub("model_name") # Upload your new template to the Hub! ``` The method `apply_chat_template()` which uses your chat template is called by the `TextGenerationPipeline` class, so once you set the correct chat template, your model will automatically become compatible with `TextGenerationPipeline`. If you're fine-tuning a model for chat, in addition to setting a chat template, you should probably add any new chat control tokens as special tokens in the tokenizer. Special tokens are never split, ensuring that your control tokens are always handled as single tokens rather than being tokenized in pieces. You should also set the tokenizer's `eos_token` attribute to the token that marks the end of assistant generations in your template. This will ensure that text generation tools can correctly figure out when to stop generating text. ### Why do some models have multiple templates? Some models use different templates for different use cases. For example, they might use one template for normal chat and another for tool-use, or retrieval-augmented generation. In these cases, `tokenizer.chat_template` is a dictionary. This can cause some confusion, and where possible, we recommend using a single template for all use-cases. You can use Jinja statements like `if tools is defined` and `{% macro %}` definitions to easily wrap multiple code paths in a single template. When a tokenizer has multiple templates, `tokenizer.chat_template` will be a `dict`, where each key is the name of a template. The `apply_chat_template` method has special handling for certain template names: Specifically, it will look for a template named `default` in most cases, and will raise an error if it can't find one. However, if a template named `tool_use` exists when the user has passed a `tools` argument, it will use that instead. To access templates with other names, pass the name of the template you want to the `chat_template` argument of `apply_chat_template()`. We find that this can be a bit confusing for users, though - so if you're writing a template yourself, we recommend trying to put it all in a single template where possible! ### What template should I use? When setting the template for a model that's already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you're training the model further - you will probably get the best performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the best performance for inference or fine-tuning when you precisely match the tokenization used during training. If you're training a model from scratch, or fine-tuning a base language model for chat, on the other hand, you have a lot of freedom to choose an appropriate template! LLMs are smart enough to learn to handle lots of different input formats. One popular choice is the `ChatML` format, and this is a good, flexible choice for many use-cases. It looks like this: ``` {%- for message in messages %} {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} {%- endfor %} ``` If you like this one, here it is in one-liner form, ready to copy into your code. The one-liner also includes handy support for [generation prompts](#what-are-generation-prompts), but note that it doesn't add BOS or EOS tokens! If your model expects those, they won't be added automatically by `apply_chat_template` - in other words, the text will be tokenized with `add_special_tokens=False`. This is to avoid potential conflicts between the template and the `add_special_tokens` logic. If your model expects special tokens, make sure to add them to the template! ```python tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" ``` This template wraps each message in `<|im_start|>` and `<|im_end|>` tokens, and simply writes the role as a string, which allows for flexibility in the roles you train with. The output looks like this: ```text <|im_start|>system You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant I'm doing great!<|im_end|> ``` The "user", "system" and "assistant" roles are the standard for chat, and we recommend using them when it makes sense, particularly if you want your model to operate well with `TextGenerationPipeline`. However, you are not limited to these roles - templating is extremely flexible, and any string can be a role. ### I want to add some chat templates! How should I get started? If you have any chat models, you should set their `tokenizer.chat_template` attribute and test it using `apply_chat_template()`, then push the updated tokenizer to the Hub. This applies even if you're not the model owner - if you're using a model with an empty chat template, or one that's still using the default class template, please open a [pull request](https://huggingface.co/docs/hub/repositories-pull-requests-discussions) to the model repository so that this attribute can be set properly! Once the attribute is set, that's it, you're done! `tokenizer.apply_chat_template` will now work correctly for that model, which means it is also automatically supported in places like `TextGenerationPipeline`! By ensuring that models have this attribute, we can make sure that the whole community gets to use the full power of open-source models. Formatting mismatches have been haunting the field and silently harming performance for too long - it's time to put an end to them! ## Advanced: Template writing tips The easiest way to get started with writing Jinja templates is to take a look at some existing ones. You can use `print(tokenizer.chat_template)` for any chat model to see what template it's using. In general, models that support tool use have much more complex templates than other models - so when you're just getting started, they're probably a bad example to learn from! You can also take a look at the [Jinja documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/#synopsis) for details of general Jinja formatting and syntax. Jinja templates in `transformers` are identical to Jinja templates elsewhere. The main thing to know is that the conversation history will be accessible inside your template as a variable called `messages`. You will be able to access `messages` in your template just like you can in Python, which means you can loop over it with `{% for message in messages %}` or access individual messages with `{{ messages[0] }}`, for example. You can also use the following tips to write clean, efficient Jinja templates: ### Trimming whitespace By default, Jinja will print any whitespace that comes before or after a block. This can be a problem for chat templates, which generally want to be very precise with whitespace! To avoid this, we strongly recommend writing your templates like this: ``` {%- for message in messages %} {{- message['role'] + message['content'] }} {%- endfor %} ``` rather than like this: ``` {% for message in messages %} {{ message['role'] + message['content'] }} {% endfor %} ``` Adding `-` will strip any whitespace that comes before the block. The second example looks innocent, but the newline and indentation may end up being included in the output, which is probably not what you want! ### Special variables Inside your template, you will have access several special variables. The most important of these is `messages`, which contains the chat history as a list of message dicts. However, there are several others. Not every variable will be used in every template. The most common other variables are: - `tools` contains a list of tools in JSON schema format. Will be `None` or undefined if no tools are passed. - `documents` contains a list of documents in the format `{"title": "Title", "contents": "Contents"}`, used for retrieval-augmented generation. Will be `None` or undefined if no documents are passed. - `add_generation_prompt` is a bool that is `True` if the user has requested a generation prompt, and `False` otherwise. If this is set, your template should add the header for an assistant message to the end of the conversation. If your model doesn't have a specific header for assistant messages, you can ignore this flag. - **Special tokens** like `bos_token` and `eos_token`. These are extracted from `tokenizer.special_tokens_map`. The exact tokens available inside each template will differ depending on the parent tokenizer. You can actually pass any `kwarg` to `apply_chat_template`, and it will be accessible inside the template as a variable. In general, we recommend trying to stick to the core variables above, as it will make your model harder to use if users have to write custom code to pass model-specific `kwargs`. However, we're aware that this field moves quickly, so if you have a new use-case that doesn't fit in the core API, feel free to use a new `kwarg` for it! If a new `kwarg` becomes common we may promote it into the core API and create a standard, documented format for it. ### Callable functions There is also a short list of callable functions available to you inside your templates. These are: - `raise_exception(msg)`: Raises a `TemplateException`. This is useful for debugging, and for telling users when they're doing something that your template doesn't support. - `strftime_now(format_str)`: Equivalent to `datetime.now().strftime(format_str)` in Python. This is used for getting the current date/time in a specific format, which is sometimes included in system messages. ### Compatibility with non-Python Jinja There are multiple implementations of Jinja in various languages. They generally have the same syntax, but a key difference is that when you're writing a template in Python you can use Python methods, such as `.lower()` on strings or `.items()` on dicts. This will break if someone tries to use your template on a non-Python implementation of Jinja. Non-Python implementations are particularly common in deployment environments, where JS and Rust are very popular. Don't panic, though! There are a few easy changes you can make to your templates to ensure they're compatible across all implementations of Jinja: - Replace Python methods with Jinja filters. These usually have the same name, for example `string.lower()` becomes `string|lower`, and `dict.items()` becomes `dict|items`. One notable change is that `string.strip()` becomes `string|trim`. See the [list of built-in filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#builtin-filters) in the Jinja documentation for more. - Replace `True`, `False` and `None`, which are Python-specific, with `true`, `false` and `none`. - Directly rendering a dict or list may give different results in other implementations (for example, string entries might change from single-quoted to double-quoted). Adding the `tojson` filter can help to ensure consistency here. ### Writing generation prompts We mentioned above that `add_generation_prompt` is a special variable that will be accessible inside your template, and is controlled by the user setting the `add_generation_prompt` flag. If your model expects a header for assistant messages, then your template must support adding the header when `add_generation_prompt` is set. Here is an example of a template that formats messages ChatML-style, with generation prompt support: ```text {{- bos_token }} {%- for message in messages %} {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %} ``` The exact content of the assistant header will depend on your specific model, but it should always be **the string that represents the start of an assistant message**, so that if the user applies your template with `add_generation_prompt=True` and then generates text, the model will write an assistant response. Also note that some models do not need a generation prompt, because assistant messages always begin immediately after user messages. This is particularly common for LLaMA and Mistral models, where assistant messages begin immediately after the `[/INST]` token that ends user messages. In these cases, the template can ignore the `add_generation_prompt` flag. Generation prompts are important! If your model requires a generation prompt but it is not set in the template, then model generations will likely be severely degraded, or the model may display unusual behaviour like continuing the final user message! ### Writing and debugging larger templates When this feature was introduced, most templates were quite small, the Jinja equivalent of a "one-liner" script. However, with new models and features like tool-use and RAG, some templates can be 100 lines long or more. When writing templates like these, it's a good idea to write them in a separate file, using a text editor. You can easily extract a chat template to a file: ```python open("template.jinja", "w").write(tokenizer.chat_template) ``` Or load the edited template back into the tokenizer: ```python tokenizer.chat_template = open("template.jinja").read() ``` As an added bonus, when you write a long, multi-line template in a separate file, line numbers in that file will exactly correspond to line numbers in template parsing or execution errors. This will make it much easier to identify the source of issues. ### Writing templates for tools Although chat templates do not enforce a specific API for tools (or for anything, really), we recommend template authors try to stick to a standard API where possible. The whole point of chat templates is to allow code to be transferable across models, so deviating from the standard tools API means users will have to write custom code to use tools with your model. Sometimes it's unavoidable, but often with clever templating you can make the standard API work! Below, we'll list the elements of the standard API, and give tips on writing templates that will work well with it. #### Tool definitions Your template should expect that the variable `tools` will either be null (if no tools are passed), or is a list of JSON schema dicts. Our chat template methods allow users to pass tools as either JSON schema or Python functions, but when functions are passed, we automatically generate JSON schema and pass that to your template. As a result, the `tools` variable that your template receives will always be a list of JSON schema. Here is a sample tool JSON schema: ```json { "type": "function", "function": { "name": "multiply", "description": "A function that multiplies two numbers", "parameters": { "type": "object", "properties": { "a": { "type": "number", "description": "The first number to multiply" }, "b": { "type": "number", "description": "The second number to multiply" } }, "required": ["a", "b"] } } } ``` And here is some example code for handling tools in your chat template. Remember, this is just an example for a specific format - your model will probably need different formatting! ```text {%- if tools %} {%- for tool in tools %} {{- '' + tool['function']['name'] + '\n' }} {%- for argument in tool['function']['parameters']['properties'] %} {{- argument + ': ' + tool['function']['parameters']['properties'][argument]['description'] + '\n' }} {%- endfor %} {{- '\n' }} {%- endif %} {%- endif %} ``` The specific tokens and tool descriptions your template renders should of course be chosen to match the ones your model was trained with. There is no requirement that your **model** understands JSON schema input, only that your template can translate JSON schema into your model's format. For example, [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024) was trained with tools defined using Python function headers, but the Command-R tool template accepts JSON schema, converts types internally and renders the input tools as Python headers. You can do a lot with templates! #### Tool calls Tool calls, if present, will be a list attached to a message with the "assistant" role. Note that `tool_calls` is always a list, even though most tool-calling models only support single tool calls at a time, which means the list will usually only have a single element. Here is a sample message dict containing a tool call: ```json { "role": "assistant", "tool_calls": [ { "type": "function", "function": { "name": "multiply", "arguments": { "a": 5, "b": 6 } } } ] } ``` And a common pattern for handling them would be something like this: ```text {%- if message['role'] == 'assistant' and 'tool_calls' in message %} {%- for tool_call in message['tool_calls'] %} {{- '' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n' }} {%- endif %} {%- endfor %} {%- endif %} ``` Again, you should render the tool call with the formatting and special tokens that your model expects. #### Tool responses Tool responses have a simple format: They are a message dict with the "tool" role, a "name" key giving the name of the called function, and a "content" key containing the result of the tool call. Here is a sample tool response: ```json { "role": "tool", "name": "multiply", "content": "30" } ``` You don't need to use all of the keys in the tool response. For example, if your model doesn't expect the function name to be included in the tool response, then rendering it can be as simple as: ```text {%- if message['role'] == 'tool' %} {{- "" + message['content'] + "" }} {%- endif %} ``` Again, remember that the actual formatting and special tokens are model-specific - you should take a lot of care to ensure that tokens, whitespace and everything else exactly match the format your model was trained with! # Tiktoken and interaction with Transformers Support for tiktoken model files is seamlessly integrated in ๐Ÿค— transformers when loading models `from_pretrained` with a `tokenizer.model` tiktoken file on the Hub, which is automatically converted into our [fast tokenizer](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast). ### Known models that were released with a `tiktoken.model`: - gpt2 - llama3 ## Example usage In order to load `tiktoken` files in `transformers`, ensure that the `tokenizer.model` file is a tiktoken file and it will automatically be loaded when loading `from_pretrained`. Here is how one would load a tokenizer and a model, which can be loaded from the exact same file: ```py from transformers import AutoTokenizer model_id = "meta-llama/Meta-Llama-3-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original") ``` # Testing Let's take a look at how ๐Ÿค— Transformers models are tested and how you can write new tests and improve the existing ones. There are 2 test suites in the repository: 1. `tests` -- tests for the general API 2. `examples` -- tests primarily for various applications that aren't part of the API ## How transformers are tested 1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this [config file](https://github.com/huggingface/transformers/tree/main/.circleci/config.yml), so that if needed you can reproduce the same environment on your machine. These CI jobs don't run `@slow` tests. 2. There are 3 jobs run by [github actions](https://github.com/huggingface/transformers/actions): - [torch hub integration](https://github.com/huggingface/transformers/tree/main/.github/workflows/github-torch-hub.yml): checks whether torch hub integration works. - [self-hosted (push)](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-push.yml): runs fast tests on GPU only on commits on `main`. It only runs if a commit on `main` has updated the code in one of the following folders: `src`, `tests`, `.github` (to prevent running on added model cards, notebooks, etc.) - [self-hosted runner](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-scheduled.yml): runs normal and slow tests on GPU in `tests` and `examples`: ```bash RUN_SLOW=1 pytest tests/ RUN_SLOW=1 pytest examples/ ``` The results can be observed [here](https://github.com/huggingface/transformers/actions). ## Running tests ### Choosing which tests to run This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them [here](https://docs.pytest.org/en/latest/usage.html). Here are some most useful ways of running tests. Run all: ```console pytest ``` or: ```bash make test ``` Note that the latter is defined as: ```bash python -m pytest -n auto --dist=loadfile -s -v ./tests/ ``` which tells pytest to: - run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!) - ensure that all tests from the same file will be run by the same test process - do not capture output - run in verbose mode ### Getting the list of all tests All tests of the test suite: ```bash pytest --collect-only -q ``` All tests of a given test file: ```bash pytest tests/test_optimization.py --collect-only -q ``` ### Run a specific test module To run an individual test module: ```bash pytest tests/utils/test_logging.py ``` ### Run specific tests Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be: ```bash pytest tests/test_optimization.py::OptimizationTest::test_adam_w ``` Here: - `tests/test_optimization.py` - the file with tests - `OptimizationTest` - the name of the class - `test_adam_w` - the name of the specific test function If the file contains multiple classes, you can choose to run only tests of a given class. For example: ```bash pytest tests/test_optimization.py::OptimizationTest ``` will run all the tests inside that class. As mentioned earlier you can see what tests are contained inside the `OptimizationTest` class by running: ```bash pytest tests/test_optimization.py::OptimizationTest --collect-only -q ``` You can run tests by keyword expressions. To run only tests whose name contains `adam`: ```bash pytest -k adam tests/test_optimization.py ``` Logical `and` and `or` can be used to indicate whether all keywords should match or either. `not` can be used to negate. To run all tests except those whose name contains `adam`: ```bash pytest -k "not adam" tests/test_optimization.py ``` And you can combine the two patterns in one: ```bash pytest -k "ada and not adam" tests/test_optimization.py ``` For example to run both `test_adafactor` and `test_adam_w` you can use: ```bash pytest -k "test_adafactor or test_adam_w" tests/test_optimization.py ``` Note that we use `or` here, since we want either of the keywords to match to include both. If you want to include only tests that include both patterns, `and` is to be used: ```bash pytest -k "test and ada" tests/test_optimization.py ``` ### Run `accelerate` tests Sometimes you need to run `accelerate` tests on your models. For that you can just add `-m accelerate_tests` to your command, if let's say you want to run these tests on `OPT` run: ```bash RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py ``` ### Run documentation tests In order to test whether the documentation examples are correct, you should check that the `doctests` are passing. As an example, let's use [`WhisperModel.forward`'s docstring](https://github.com/huggingface/transformers/blob/1124d95dbb1a3512d3e80791d73d0f541d1d7e9f/src/transformers/models/whisper/modeling_whisper.py#L1591-L1609) ```python r""" Returns: Example: ```python >>> import torch >>> from transformers import WhisperModel, WhisperFeatureExtractor >>> from datasets import load_dataset >>> model = WhisperModel.from_pretrained("openai/whisper-base") >>> feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base") >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") >>> inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt") >>> input_features = inputs.input_features >>> decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id >>> last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state >>> list(last_hidden_state.shape) [1, 2, 512] ```""" ``` Just run the following line to automatically test every docstring example in the desired file: ```bash pytest --doctest-modules ``` If the file has a markdown extention, you should add the `--doctest-glob="*.md"` argument. ### Run only modified tests You can run the tests related to the unstaged files or the current branch (according to Git) by using [pytest-picked](https://github.com/anapaulagomes/pytest-picked). This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch. ```bash pip install pytest-picked ``` ```bash pytest --picked ``` All tests will be run from files and folders which are modified, but not yet committed. ### Automatically rerun failed tests on source modification [pytest-xdist](https://github.com/pytest-dev/pytest-xdist) provides a very useful feature of detecting all failed tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after which again a full run is performed. ```bash pip install pytest-xdist ``` To enter the mode: `pytest -f` or `pytest --looponfail` File changes are detected by looking at `looponfailroots` root directories and all of their contents (recursively). If the default for this value does not work for you, you can change it in your project by setting a configuration option in `setup.cfg`: ```ini [tool:pytest] looponfailroots = transformers tests ``` or `pytest.ini`/``tox.ini`` files: ```ini [pytest] looponfailroots = transformers tests ``` This would lead to only looking for file changes in the respective directories, specified relatively to the ini-fileโ€™s directory. [pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality. ### Skip a test module If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except `test_modeling_*.py` tests: ```bash pytest *ls -1 tests/*py | grep -v test_modeling* ``` ### Clearing state CI builds and when isolation is important (against speed), cache should be cleared: ```bash pytest --cache-clear tests ``` ### Running tests in parallel As mentioned earlier `make test` runs tests in parallel via `pytest-xdist` plugin (`-n X` argument, e.g. `-n 2` to run 2 parallel jobs). `pytest-xdist`'s `--dist=` option allows one to control how the tests are grouped. `--dist=loadfile` puts the tests located in one file onto the same process. Since the order of executed tests is different and unpredictable, if running the test suite with `pytest-xdist` produces failures (meaning we have some undetected coupled tests), use [pytest-replay](https://github.com/ESSS/pytest-replay) to replay the tests in the same order, which should help with then somehow reducing that failing sequence to a minimum. ### Test order and repetition It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect some problems that get uncovered by randomness of DL. #### Repeat tests - [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): ```bash pip install pytest-flakefinder ``` And then run every test multiple times (50 by default): ```bash pytest --flake-finder --flake-runs=5 tests/test_failing_test.py ``` This plugin doesn't work with `-n` flag from `pytest-xdist`. There is another plugin `pytest-repeat`, but it doesn't work with `unittest`. #### Run tests in a random order ```bash pip install pytest-random-order ``` Important: the presence of `pytest-random-order` will automatically randomize tests, no configuration change or command line options is required. As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When `pytest-random-order` is installed it will print the random seed it used for that session, e.g: ```bash pytest tests [...] Using --random-order-bucket=module Using --random-order-seed=573663 ``` So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.: ```bash pytest --random-order-seed=573663 [...] Using --random-order-bucket=module Using --random-order-seed=573663 ``` It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order they failed and tell pytest to not randomize them instead using `--random-order-bucket=none`, e.g.: ```bash pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py ``` To disable the shuffling for all tests: ```bash pytest --random-order-bucket=none ``` By default `--random-order-bucket=module` is implied, which will shuffle the files on the module levels. It can also shuffle on `class`, `package`, `global` and `none` levels. For the complete details please see its [documentation](https://github.com/jbasko/pytest-random-order). Another randomization alternative is: [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly). This module has a very similar functionality/interface, but it doesn't have the bucket modes available in `pytest-random-order`. It has the same problem of imposing itself once installed. ### Look and feel variations #### pytest-sugar [pytest-sugar](https://github.com/Frozenball/pytest-sugar) is a plugin that improves the look-n-feel, adds a progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation. ```bash pip install pytest-sugar ``` To run tests without it, run: ```bash pytest -p no:sugar ``` or uninstall it. #### Report each sub-test name and its progress For a single or a group of tests via `pytest` (after `pip install pytest-pspec`): ```bash pytest --pspec tests/test_optimization.py ``` #### Instantly shows failed tests [pytest-instafail](https://github.com/pytest-dev/pytest-instafail) shows failures and errors instantly instead of waiting until the end of test session. ```bash pip install pytest-instafail ``` ```bash pytest --instafail ``` ### To GPU or not to GPU On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`: ```bash CUDA_VISIBLE_DEVICES="" pytest tests/utils/test_logging.py ``` or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the second gpu if you have gpus `0` and `1`, you can run: ```bash CUDA_VISIBLE_DEVICES="1" pytest tests/utils/test_logging.py ``` This is handy when you want to run different tasks on different GPUs. Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise: - `require_torch` - this test will run only under torch - `require_torch_gpu` - as `require_torch` plus requires at least 1 GPU - `require_torch_multi_gpu` - as `require_torch` plus requires at least 2 GPUs - `require_torch_non_multi_gpu` - as `require_torch` plus requires 0 or 1 GPUs - `require_torch_up_to_2_gpus` - as `require_torch` plus requires 0 or 1 or 2 GPUs - `require_torch_xla` - as `require_torch` plus requires at least 1 TPU Let's depict the GPU requirements in the following table: | n gpus | decorator | |--------|--------------------------------| | `>= 0` | `@require_torch` | | `>= 1` | `@require_torch_gpu` | | `>= 2` | `@require_torch_multi_gpu` | | `< 2` | `@require_torch_non_multi_gpu` | | `< 3` | `@require_torch_up_to_2_gpus` | For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed: ```python no-style @require_torch_multi_gpu def test_example_with_multi_gpu(): ``` If a test requires `tensorflow` use the `require_tf` decorator. For example: ```python no-style @require_tf def test_tf_thing_with_tensorflow(): ``` These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up: ```python no-style @require_torch_gpu @slow def test_example_slow_on_gpu(): ``` Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage: ```python no-style @parameterized.expand(...) @require_torch_multi_gpu def test_integration_foo(): ``` This order problem doesn't exist with `@pytest.mark.parametrize`, you can put it first or last and it will still work. But it only works with non-unittests. Inside tests: - How many GPUs are available: ```python from transformers.testing_utils import get_gpu_count n_gpu = get_gpu_count() # works with torch and tf ``` ### Testing with a specific PyTorch backend or device To run the test suite on a specific torch device add `TRANSFORMERS_TEST_DEVICE="$device"` where `$device` is the target backend. For example, to test on CPU only: ```bash TRANSFORMERS_TEST_DEVICE="cpu" pytest tests/utils/test_logging.py ``` This variable is useful for testing custom or less common PyTorch backends such as `mps`, `xpu` or `npu`. It can also be used to achieve the same effect as `CUDA_VISIBLE_DEVICES` by targeting specific GPUs or testing in CPU-only mode. Certain devices will require an additional import after importing `torch` for the first time. This can be specified using the environment variable `TRANSFORMERS_TEST_BACKEND`: ```bash TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py ``` Alternative backends may also require the replacement of device-specific functions. For example `torch.cuda.manual_seed` may need to be replaced with a device-specific seed setter like `torch.npu.manual_seed` or `torch.xpu.manual_seed` to correctly set a random seed on the device. To specify a new backend with backend-specific device functions when running the test suite, create a Python device specification file `spec.py` in the format: ```python import torch import torch_npu # for xpu, replace it with `import intel_extension_for_pytorch` # !! Further additional imports can be added here !! # Specify the device name (eg. 'cuda', 'cpu', 'npu', 'xpu', 'mps') DEVICE_NAME = 'npu' # Specify device-specific backends to dispatch to. # If not specified, will fallback to 'default' in 'testing_utils.py` MANUAL_SEED_FN = torch.npu.manual_seed EMPTY_CACHE_FN = torch.npu.empty_cache DEVICE_COUNT_FN = torch.npu.device_count ``` This format also allows for specification of any additional imports required. To use this file to replace equivalent methods in the test suite, set the environment variable `TRANSFORMERS_TEST_DEVICE_SPEC` to the path of the spec file, e.g. `TRANSFORMERS_TEST_DEVICE_SPEC=spec.py`. Currently, only `MANUAL_SEED_FN`, `EMPTY_CACHE_FN` and `DEVICE_COUNT_FN` are supported for device-specific dispatch. ### Distributed training `pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are `pytest` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes. Here are some tests that use it: - [test_trainer_distributed.py](https://github.com/huggingface/transformers/tree/main/tests/trainer/test_trainer_distributed.py) - [test_deepspeed.py](https://github.com/huggingface/transformers/tree/main/tests/deepspeed/test_deepspeed.py) To jump right into the execution point, search for the `execute_subprocess_async` call in those tests. You will need at least 2 GPUs to see these tests in action: ```bash CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py ``` ### Output capture During test execution any output sent to `stdout` and `stderr` is captured. If a test or a setup method fails, its according captured output will usually be shown along with the failure traceback. To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`: ```bash pytest -s tests/utils/test_logging.py ``` To send test results to JUnit format output: ```bash pytest tests --junitxml=result.xml ``` ### Color control To have no color (e.g., yellow on white background is not readable): ```bash pytest --color=no tests/utils/test_logging.py ``` ### Sending test report to online pastebin service Creating a URL for each test failure: ```bash pytest --pastebin=failed tests/utils/test_logging.py ``` This will submit test run information to a remote Paste service and provide a URL for each failure. You may select tests as usual or add for example -x if you only want to send one particular failure. Creating a URL for a whole test session log: ```bash pytest --pastebin=all tests/utils/test_logging.py ``` ## Writing tests ๐Ÿค— transformers tests are based on `unittest`, but run by `pytest`, so most of the time features from both systems can be used. You can read [here](https://docs.pytest.org/en/stable/unittest.html) which features are supported, but the important thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module `parameterized` that works in a similar way. ### Parametrization Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments. ```python # test_this1.py import unittest from parameterized import parameterized class TestMathUnitTest(unittest.TestCase): @parameterized.expand( [ ("negative", -1.5, -2.0), ("integer", 1, 1.0), ("large fraction", 1.6, 1), ] ) def test_floor(self, name, input, expected): assert_equal(math.floor(input), expected) ``` Now, by default this test will be run 3 times, each time with the last 3 arguments of `test_floor` being assigned the corresponding arguments in the parameter list. and you could run just the `negative` and `integer` sets of params with: ```bash pytest -k "negative and integer" tests/test_mytest.py ``` or all but `negative` sub-tests, with: ```bash pytest -k "not negative" tests/test_mytest.py ``` Besides using the `-k` filter that was just mentioned, you can find out the exact name of each sub-test and run any or all of them using their exact names. ```bash pytest test_this1.py --collect-only -q ``` and it will list: ```bash test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer test_this1.py::TestMathUnitTest::test_floor_2_large_fraction ``` So now you can run just 2 specific sub-tests: ```bash pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer ``` The module [parameterized](https://pypi.org/project/parameterized/) which is already in the developer dependencies of `transformers` works for both: `unittests` and `pytest` tests. If, however, the test is not a `unittest`, you may use `pytest.mark.parametrize` (or you may see it being used in some existing tests, mostly under `examples`). Here is the same example, this time using `pytest`'s `parametrize` marker: ```python # test_this2.py import pytest @pytest.mark.parametrize( "name, input, expected", [ ("negative", -1.5, -2.0), ("integer", 1, 1.0), ("large fraction", 1.6, 1), ], ) def test_floor(name, input, expected): assert_equal(math.floor(input), expected) ``` Same as with `parameterized`, with `pytest.mark.parametrize` you can have a fine control over which sub-tests are run, if the `-k` filter doesn't do the job. Except, this parametrization function creates a slightly different set of names for the sub-tests. Here is what they look like: ```bash pytest test_this2.py --collect-only -q ``` and it will list: ```bash test_this2.py::test_floor[integer-1-1.0] test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[large fraction-1.6-1] ``` So now you can run just the specific test: ```bash pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] ``` as in the previous example. ### Files and directories In tests often we need to know where things are relative to the current test file, and it's not trivial since the test could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class `transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy accessors to them: - `pathlib` objects (all fully resolved): - `test_file_path` - the current test file path, i.e. `__file__` - `test_file_dir` - the directory containing the current test file - `tests_dir` - the directory of the `tests` test suite - `examples_dir` - the directory of the `examples` test suite - `repo_root_dir` - the directory of the repository - `src_dir` - the directory of `src` (i.e. where the `transformers` sub-dir resides) - stringified paths---same as above but these return paths as strings, rather than `pathlib` objects: - `test_file_path_str` - `test_file_dir_str` - `tests_dir_str` - `examples_dir_str` - `repo_root_dir_str` - `src_dir_str` To start using those all you need is to make sure that the test resides in a subclass of `transformers.test_utils.TestCasePlus`. For example: ```python from transformers.testing_utils import TestCasePlus class PathExampleTest(TestCasePlus): def test_something_involving_local_locations(self): data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" ``` If you don't need to manipulate paths via `pathlib` or you just need a path as a string, you can always invoked `str()` on the `pathlib` object or use the accessors ending with `_str`. For example: ```python from transformers.testing_utils import TestCasePlus class PathExampleTest(TestCasePlus): def test_something_involving_stringified_locations(self): examples_dir = self.examples_dir_str ``` ### Temporary files and directories Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temporary files and directories removed at the end of each test that created them. Therefore, using packages like `tempfile`, which address these needs is essential. However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want to know it's exact path and not having it randomized on every test re-run. A helper class `transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of `unittest.TestCase`, so we can easily inherit from it in the test modules. Here is an example of its usage: ```python from transformers.testing_utils import TestCasePlus class ExamplesTests(TestCasePlus): def test_whatever(self): tmp_dir = self.get_auto_remove_tmp_dir() ``` This code creates a unique temporary directory, and sets `tmp_dir` to its location. - Create a unique temporary dir: ```python def test_whatever(self): tmp_dir = self.get_auto_remove_tmp_dir() ``` `tmp_dir` will contain the path to the created temporary dir. It will be automatically removed at the end of the test. - Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test. ```python def test_whatever(self): tmp_dir = self.get_auto_remove_tmp_dir("./xxx") ``` This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't leave any data in there. - You can override the default behavior by directly overriding the `before` and `after` args, leading to one of the following behaviors: - `before=True`: the temporary dir will always be cleared at the beginning of the test. - `before=False`: if the temporary dir already existed, any existing files will remain there. - `after=True`: the temporary dir will always be deleted at the end of the test. - `after=False`: the temporary dir will always be left intact at the end of the test. In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if an explicit `tmp_dir` is used, so that by mistake no `/tmp` or similar important part of the filesystem will get nuked. i.e. please always pass paths that start with `./`. Each test can register multiple temporary directories and they all will get auto-removed, unless requested otherwise. ### Temporary sys.path override If you need to temporary override `sys.path` to import from another test for example, you can use the `ExtendSysPath` context manager. Example: ```python import os from transformers.testing_utils import ExtendSysPath bindir = os.path.abspath(os.path.dirname(__file__)) with ExtendSysPath(f"{bindir}/.."): from test_trainer import TrainerIntegrationCommon # noqa ``` ### Skipping tests This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to commit it to the main repository we need make sure it's skipped during `make test`. Methods: - A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping tests that depend on an external resource which is not available at the moment (for example a database). - A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with pytest.mark.xfail), itโ€™s an xpass and will be reported in the test summary. One of the important differences between the two is that `skip` doesn't run the test, and `xfail` does. So if the code that's buggy causes some bad state that will affect other tests, do not use `xfail`. #### Implementation - Here is how to skip whole test unconditionally: ```python no-style @unittest.skip(reason="this bug needs to be fixed") def test_feature_x(): ``` or via pytest: ```python no-style @pytest.mark.skip(reason="this bug needs to be fixed") ``` or the `xfail` way: ```python no-style @pytest.mark.xfail def test_feature_x(): ``` Here's how to skip a test based on internal checks within the test: ```python def test_feature_x(): if not has_something(): pytest.skip("unsupported configuration") ``` or the whole module: ```python import pytest if not pytest.config.getoption("--custom-flag"): pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) ``` or the `xfail` way: ```python def test_feature_x(): pytest.xfail("expected to fail until bug XYZ is fixed") ``` - Here is how to skip all tests in a module if some import is missing: ```python docutils = pytest.importorskip("docutils", minversion="0.3") ``` - Skip a test based on a condition: ```python no-style @pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") def test_feature_x(): ``` or: ```python no-style @unittest.skipIf(torch_device == "cpu", "Can't do half precision") def test_feature_x(): ``` or skip the whole module: ```python no-style @pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") class TestClass(): def test_feature_x(self): ``` More details, example and ways are [here](https://docs.pytest.org/en/latest/skipping.html). ### Slow tests The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be marked as in the example below: ```python no-style from transformers.testing_utils import slow @slow def test_integration_foo(): ``` Once a test is marked as `@slow`, to run such tests set `RUN_SLOW=1` env var, e.g.: ```bash RUN_SLOW=1 pytest tests ``` Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators `@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage: ```python no-style @parameterized.expand(...) @slow def test_integration_foo(): ``` As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your machine before submitting the PR. Here is a rough decision making mechanism for choosing which tests should be marked as slow: If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files, pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library, such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine this approach we should have exceptions: - All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs. - All tests that need to do a training not specifically optimized to be fast should be set to slow. - We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to `@slow`. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked as `@slow`. - If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless. Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example, a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the `@slow` tests can use large slow models to do qualitative testing. To see the use of these simply look for *tiny* models with: ```bash grep tiny tests examples ``` Here is an example of a [script](https://github.com/huggingface/transformers/tree/main/scripts/fsmt/fsmt-make-tiny-model.py) that created the tiny model [stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). You can easily adjust it to your specific model's architecture. It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the execution speed report in CI logs instead (the output of `pytest --durations=0 tests`). That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast. If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest tests. ### Testing the stdout/stderr output In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the `pytest`'s [capsys system](https://docs.pytest.org/en/latest/capture.html). Here is how this is accomplished: ```python import sys def print_to_stdout(s): print(s) def print_to_stderr(s): sys.stderr.write(s) def test_result_and_stdout(capsys): msg = "Hello" print_to_stdout(msg) print_to_stderr(msg) out, err = capsys.readouterr() # consume the captured output streams # optional: if you want to replay the consumed streams: sys.stdout.write(out) sys.stderr.write(err) # test: assert msg in out assert msg in err ``` And, of course, most of the time, `stderr` will come as a part of an exception, so try/except has to be used in such a case: ```python def raise_exception(msg): raise ValueError(msg) def test_something_exception(): msg = "Not a good value" error = "" try: raise_exception(msg) except Exception as e: error = str(e) assert msg in error, f"{msg} is in the exception:\n{error}" ``` Another approach to capturing stdout is via `contextlib.redirect_stdout`: ```python from io import StringIO from contextlib import redirect_stdout def print_to_stdout(s): print(s) def test_result_and_stdout(): msg = "Hello" buffer = StringIO() with redirect_stdout(buffer): print_to_stdout(msg) out = buffer.getvalue() # optional: if you want to replay the consumed streams: sys.stdout.write(out) # test: assert msg in out ``` An important potential issue with capturing stdout is that it may contain `\r` characters that in normal `print` reset everything that has been printed so far. There is no problem with `pytest`, but with `pytest -s` these characters get included in the buffer, so to be able to have the test run with and without `-s`, you have to make an extra cleanup to the captured output, using `re.sub(r'~.*\r', '', buf, 0, re.M)`. But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has some `\r`'s in it or not, so it's a simple: ```python from transformers.testing_utils import CaptureStdout with CaptureStdout() as cs: function_that_writes_to_stdout() print(cs.out) ``` Here is a full test example: ```python from transformers.testing_utils import CaptureStdout msg = "Secret message\r" final = "Hello World" with CaptureStdout() as cs: print(msg + final) assert cs.out == final + "\n", f"captured: {cs.out}, expecting {final}" ``` If you'd like to capture `stderr` use the `CaptureStderr` class instead: ```python from transformers.testing_utils import CaptureStderr with CaptureStderr() as cs: function_that_writes_to_stderr() print(cs.err) ``` If you need to capture both streams at once, use the parent `CaptureStd` class: ```python from transformers.testing_utils import CaptureStd with CaptureStd() as cs: function_that_writes_to_stdout_and_stderr() print(cs.err, cs.out) ``` Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit from the context. ### Capturing logger stream If you need to validate the output of a logger, you can use `CaptureLogger`: ```python from transformers import logging from transformers.testing_utils import CaptureLogger msg = "Testing 1, 2, 3" logging.set_verbosity_info() logger = logging.get_logger("transformers.models.bart.tokenization_bart") with CaptureLogger(logger) as cl: logger.info(msg) assert cl.out, msg + "\n" ``` ### Testing with environment variables If you want to test the impact of environment variables for a specific test you can use a helper decorator `transformers.testing_utils.mockenv` ```python from transformers.testing_utils import mockenv class HfArgumentParserTest(unittest.TestCase): @mockenv(TRANSFORMERS_VERBOSITY="error") def test_env_override(self): env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) ``` At times an external program needs to be called, which requires setting `PYTHONPATH` in `os.environ` to include multiple local paths. A helper class `transformers.test_utils.TestCasePlus` comes to help: ```python from transformers.testing_utils import TestCasePlus class EnvExampleTest(TestCasePlus): def test_external_prog(self): env = self.get_env() # now call the external program, passing `env` to it ``` Depending on whether the test file was under the `tests` test suite or `examples` it'll correctly set up `env[PYTHONPATH]` to include one of these two directories, and also the `src` directory to ensure the testing is done against the current repo, and finally with whatever `env[PYTHONPATH]` was already set to before the test was called if anything. This helper method creates a copy of the `os.environ` object, so the original remains intact. ### Getting reproducible results In some situations you may want to remove randomness for your tests. To get identical reproducible results set, you will need to fix the seed: ```python seed = 42 # python RNG import random random.seed(seed) # pytorch RNGs import torch torch.manual_seed(seed) torch.backends.cudnn.deterministic = True if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) # numpy RNG import numpy as np np.random.seed(seed) # tf RNG import tensorflow as tf tf.random.set_seed(seed) ``` ### Debugging tests To start a debugger at the point of the warning, do this: ```bash pytest tests/utils/test_logging.py -W error::UserWarning --pdb ``` ## Working with github actions workflows To trigger a self-push workflow CI job, you must: 1. Create a new branch on `transformers` origin (not a fork!). 2. The branch name has to start with either `ci_` or `ci-` (`main` triggers it too, but we can't do PRs on `main`). It also gets triggered only for specific paths - you can find the up-to-date definition in case it changed since this document has been written [here](https://github.com/huggingface/transformers/blob/main/.github/workflows/self-push.yml) under *push:* 3. Create a PR from this branch. 4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there is a backlog. ## Testing Experimental CI Features Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a new CI feature is to be added, it should be done as following. 1. Create a new dedicated job that tests what needs to be tested 2. The new job must always succeed so that it gives us a green โœ“ (details below). 3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches, non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always green) 4. When it's clear that everything is solid, then merge the new changes into existing jobs. That way experiments on CI functionality itself won't interfere with the normal workflow. Now how can we make the job always succeed while the new CI feature is being developed? Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and Github Actions as of this writing don't support that. So the following workaround can be used: 1. `set +euo pipefail` at the beginning of the run command to suppress most potential failures in the bash script. 2. the last command must be a success: `echo "done"` or just `true` will do Here is an example: ```yaml - run: name: run CI experiment command: | set +euo pipefail echo "setting run-all-despite-any-errors-mode" this_command_will_fail echo "but bash continues to run" # emulate another failure false # but the last command must be a success echo "during experiment do not remove: reporting success to CI, even if there were failures" ``` For simple commands you could also do: ```bash cmd_that_may_fail || true ``` Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs, while removing `set +euo pipefail` or any other things you may have added to ensure that the experimental job doesn't interfere with the normal CI functioning. This whole process would have been much easier if we only could set something like `allow-failure` for the experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and Github Actions don't support it at the moment. You can vote for this feature and see where it is at these CI-specific threads: - [Github Actions:](https://github.com/actions/toolkit/issues/399) - [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) ## DeepSpeed integration For a PR that involves the DeepSpeed integration, keep in mind our CircleCI PR CI setup doesn't have GPUs. Tests requiring GPUs are run on a different CI nightly. This means if you get a passing CI report in your PR, it doesnโ€™t mean the DeepSpeed tests pass. To run DeepSpeed tests: ```bash RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py ``` Any changes to the modeling or PyTorch examples code requires running the model zoo tests as well. ```bash RUN_SLOW=1 pytest tests/deepspeed ``` # Instantiate a big model A barrier to accessing very large pretrained models is the amount of memory required. When loading a pretrained PyTorch model, you usually: 1. Create a model with random weights. 2. Load your pretrained weights. 3. Put those pretrained weights in the model. The first two steps both require a full version of the model in memory and if the model weighs several GBs, you may not have enough memory for two copies of it. This problem is amplified in distributed training environments because each process loads a pretrained model and stores two copies in memory. > [!TIP] > The randomly created model is initialized with "empty" tensors, which take space in memory without filling it. The random values are whatever was in this chunk of memory at the time. To improve loading speed, the [`_fast_init`](https://github.com/huggingface/transformers/blob/c9f6e5e35156e068b227dd9b15521767f6afd4d2/src/transformers/modeling_utils.py#L2710) parameter is set to `True` by default to skip the random initialization for all weights that are correctly loaded. This guide will show you how Transformers can help you load large pretrained models despite their memory requirements. ## Sharded checkpoints From Transformers v4.18.0, a checkpoint larger than 10GB is automatically sharded by the `save_pretrained()` method. It is split into several smaller partial checkpoints and creates an index file that maps parameter names to the files they're stored in. The maximum shard size is controlled with the `max_shard_size` parameter, but by default it is 5GB, because it is easier to run on free-tier GPU instances without running out of memory. For example, let's shard [BioMistral/BioMistral-7B](https://hf.co/BioMistral/BioMistral-7B). ```py >>> with tempfile.TemporaryDirectory() as tmp_dir: ... model.save_pretrained(tmp_dir, max_shard_size="5GB") ... print(sorted(os.listdir(tmp_dir))) ['config.json', 'generation_config.json', 'model-00001-of-00006.safetensors', 'model-00002-of-00006.safetensors', 'model-00003-of-00006.safetensors', 'model-00004-of-00006.safetensors', 'model-00005-of-00006.safetensors', 'model-00006-of-00006.safetensors', 'model.safetensors.index.json'] ``` The sharded checkpoint is reloaded with the `from_pretrained()` method. ```py >>> with tempfile.TemporaryDirectory() as tmp_dir: ... model.save_pretrained(tmp_dir, max_shard_size="5GB") ... new_model = AutoModel.from_pretrained(tmp_dir) ``` The main advantage of sharded checkpoints for big models is that each shard is loaded after the previous one, which caps the memory usage to only the model size and the largest shard size. You could also directly load a sharded checkpoint inside a model without the `from_pretrained()` method (similar to PyTorch's `load_state_dict()` method for a full checkpoint). In this case, use the `load_sharded_checkpoint()` method. ```py >>> from transformers.modeling_utils import load_sharded_checkpoint >>> with tempfile.TemporaryDirectory() as tmp_dir: ... model.save_pretrained(tmp_dir, max_shard_size="5GB") ... load_sharded_checkpoint(model, tmp_dir) ``` ### Shard metadata The index file determines which keys are in the checkpoint and where the corresponding weights are stored. This file is loaded like any other JSON file and you can get a dictionary from it. ```py >>> import json >>> with tempfile.TemporaryDirectory() as tmp_dir: ... model.save_pretrained(tmp_dir, max_shard_size="5GB") ... with open(os.path.join(tmp_dir, "model.safetensors.index.json"), "r") as f: ... index = json.load(f) >>> print(index.keys()) dict_keys(['metadata', 'weight_map']) ``` The `metadata` key provides the total model size. ```py >>> index["metadata"] {'total_size': 28966928384} ``` The `weight_map` key maps each parameter name (typically `state_dict` in a PyTorch model) to the shard it's stored in. ```py >>> index["weight_map"] {'lm_head.weight': 'model-00006-of-00006.safetensors', 'model.embed_tokens.weight': 'model-00001-of-00006.safetensors', 'model.layers.0.input_layernorm.weight': 'model-00001-of-00006.safetensors', 'model.layers.0.mlp.down_proj.weight': 'model-00001-of-00006.safetensors', ... } ``` ## Accelerate's Big Model Inference > [!TIP] > Make sure you have Accelerate v0.9.0 or later and PyTorch v1.9.0 or later installed. From Transformers v4.20.0, the `from_pretrained()` method is supercharged with Accelerate's [Big Model Inference](https://hf.co/docs/accelerate/usage_guides/big_modeling) feature to efficiently handle really big models! Big Model Inference creates a *model skeleton* on PyTorch's [**meta**](https://pytorch.org/docs/main/meta.html) device. The randomly initialized parameters are only created when the pretrained weights are loaded. This way, you aren't keeping two copies of the model in memory at the same time (one for the randomly initialized model and one for the pretrained weights), and the maximum memory consumed is only the full model size. To enable Big Model Inference in Transformers, set `low_cpu_mem_usage=True` in the `from_pretrained()` method. ```py from transformers import AutoModelForCausalLM gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", low_cpu_mem_usage=True) ``` Accelerate automatically dispatches the model weights across all available devices, starting with the fastest device (GPU) first and then offloading to the slower devices (CPU and even hard drive). This is enabled by setting `device_map="auto"` in the `from_pretrained()` method. When you pass the `device_map` parameter, `low_cpu_mem_usage` is automatically set to `True` so you don't need to specify it. ```py from transformers import AutoModelForCausalLM # these loading methods are equivalent gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto") gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto", low_cpu_mem_usage=True) ``` You can also write your own `device_map` by mapping each layer to a device. It should map all model parameters to a device, but you don't have to detail where all the submodules of a layer go if the entire layer is on the same device. ```python device_map = {"model.layers.1": 0, "model.layers.14": 1, "model.layers.31": "cpu", "lm_head": "disk"} ``` Access `hf_device_map` attribute to see how Accelerate split the model across devices. ```py gemma.hf_device_map ``` ```python out {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'cpu', 'model.layers.29': 'cpu', 'model.layers.30': 'cpu', 'model.layers.31': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'} ``` ## Model data type PyTorch model weights are normally instantiated as torch.float32 and it can be an issue if you try to load a model as a different data type. For example, you'd need twice as much memory to load the weights in torch.float32 and then again to load them in your desired data type, like torch.float16. > [!WARNING] > Due to how PyTorch is designed, the `torch_dtype` parameter only supports floating data types. To avoid wasting memory like this, explicitly set the `torch_dtype` parameter to the desired data type or set `torch_dtype="auto"` to load the weights with the most optimal memory pattern (the data type is automatically derived from the model weights). ```py from transformers import AutoModelForCausalLM gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", torch_dtype=torch.float16) ``` ```py from transformers import AutoModelForCausalLM gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", torch_dtype="auto") ``` You can also set the data type to use for models instantiated from scratch. ```python import torch from transformers import AutoConfig, AutoModel my_config = AutoConfig.from_pretrained("google/gemma-2b", torch_dtype=torch.float16) model = AutoModel.from_config(my_config) ``` # Preprocess Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, it needs to be converted and assembled into batches of tensors. ๐Ÿค— Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for: * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. * Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors. * Image inputs use a [ImageProcessor](./main_classes/image_processor) to convert images into tensors. * Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor. `AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor. Before you begin, install ๐Ÿค— Datasets so you can load some datasets to experiment with: ```bash pip install datasets ``` ## Natural Language Processing The main tool for preprocessing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer splits text into *tokens* according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer. If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining. Get started by loading a pretrained tokenizer with the `AutoTokenizer.from_pretrained()` method. This downloads the *vocab* a model was pretrained with: ```py >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") ``` Then pass your text to the tokenizer: ```py >>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") >>> print(encoded_input) {'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} ``` The tokenizer returns a dictionary with three important items: * [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence. * [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not. * [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence. Return your input by decoding the `input_ids`: ```py >>> tokenizer.decode(encoded_input["input_ids"]) '[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]' ``` As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need special tokens, but if they do, the tokenizer automatically adds them for you. If there are several sentences you want to preprocess, pass them as a list to the tokenizer: ```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_inputs = tokenizer(batch_sentences) >>> print(encoded_inputs) {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]} ``` ### Pad Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences. Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence: ```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True) >>> print(encoded_input) {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` The first and third sentences are now padded with `0`'s because they are shorter. ### Truncation On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you'll need to truncate the sequence to a shorter length. Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model: ```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) >>> print(encoded_input) {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` Check out the [Padding and truncation](./pad_truncation) concept guide to learn more different padding and truncation arguments. ### Build tensors Finally, you want the tokenizer to return the actual tensors that get fed to the model. Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow: ```py >>> batch_sentences = [ ... "But what about second breakfast?", ... "Don't think he knows about second breakfast, Pip.", ... "What about elevensies?", ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") >>> print(encoded_input) {'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} ``` Different pipelines support tokenizer arguments in their `__call__()` differently. `text-2-text-generation` pipelines support (i.e. pass on) only `truncation`. `text-generation` pipelines support `max_length`, `truncation`, `padding` and `add_special_tokens`. In `fill-mask` pipelines, tokenizer arguments can be passed in the `tokenizer_kwargs` argument (dictionary). ## Audio For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors. Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the ๐Ÿค— [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets: ```py >>> from datasets import load_dataset, Audio >>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") ``` Access the first element of the `audio` column to take a look at the input. Calling the `audio` column automatically loads and resamples the audio file: ```py >>> dataset[0]["audio"] {'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, 0. , 0. ], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'sampling_rate': 8000} ``` This returns three items: * `array` is the speech signal loaded - and potentially resampled - as a 1D array. * `path` points to the location of the audio file. * `sampling_rate` refers to how many data points in the speech signal are measured per second. For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 1. Use ๐Ÿค— Datasets' [cast_column](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz: ```py >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) ``` 2. Call the `audio` column again to resample the audio file: ```py >>> dataset[0]["audio"] {'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'sampling_rate': 16000} ``` Next, load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data. The feature extractor adds a `0` - interpreted as silence - to `array`. Load the feature extractor with `AutoFeatureExtractor.from_pretrained()`: ```py >>> from transformers import AutoFeatureExtractor >>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") ``` Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur. ```py >>> audio_input = [dataset[0]["audio"]["array"]] >>> feature_extractor(audio_input, sampling_rate=16000) {'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ..., 5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]} ``` Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples: ```py >>> dataset[0]["audio"]["array"].shape (173398,) >>> dataset[1]["audio"]["array"].shape (106496,) ``` Create a function to preprocess the dataset so the audio samples are the same lengths. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it: ```py >>> def preprocess_function(examples): ... audio_arrays = [x["array"] for x in examples["audio"]] ... inputs = feature_extractor( ... audio_arrays, ... sampling_rate=16000, ... padding=True, ... max_length=100000, ... truncation=True, ... ) ... return inputs ``` Apply the `preprocess_function` to the first few examples in the dataset: ```py >>> processed_dataset = preprocess_function(dataset[:5]) ``` The sample lengths are now the same and match the specified maximum length. You can pass your processed dataset to the model now! ```py >>> processed_dataset["input_values"][0].shape (100000,) >>> processed_dataset["input_values"][1].shape (100000,) ``` ## Computer vision For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. Image preprocessing consists of several steps that convert images into the input expected by the model. These steps include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors. Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation transform image data, but they serve different purposes: * Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations. * Image preprocessing guarantees that the images match the modelโ€™s expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained. You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model. Load the [food101](https://huggingface.co/datasets/food101) dataset (see the ๐Ÿค— [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets: Use ๐Ÿค— Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large! ```py >>> from datasets import load_dataset >>> dataset = load_dataset("food101", split="train[:100]") ``` Next, take a look at the image with ๐Ÿค— Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) feature: ```py >>> dataset[0]["image"] ```
Load the image processor with `AutoImageProcessor.from_pretrained()`: ```py >>> from transformers import AutoImageProcessor >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") ``` First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb). 1. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html). Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and width are expected, for others only the `shortest_edge` is defined. ```py >>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose >>> size = ( ... image_processor.size["shortest_edge"] ... if "shortest_edge" in image_processor.size ... else (image_processor.size["height"], image_processor.size["width"]) ... ) >>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)]) ``` 2. The model accepts [`pixel_values`](model_doc/vision-encoder-decoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors. Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`: ```py >>> def transforms(examples): ... images = [_transforms(img.convert("RGB")) for img in examples["image"]] ... examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"] ... return examples ``` In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation, and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation, leave this parameter out. By default, `ImageProcessor` will handle the resizing. If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`, and `image_processor.image_std` values. 3. Then use ๐Ÿค— Datasets[set_transform](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.set_transform) to apply the transforms on the fly: ```py >>> dataset.set_transform(transforms) ``` 4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now! ```py >>> dataset[0].keys() ``` Here is what the image looks like after the transforms are applied. The image has been randomly cropped and it's color properties are different. ```py >>> import numpy as np >>> import matplotlib.pyplot as plt >>> img = dataset[0]["pixel_values"] >>> plt.imshow(img.permute(1, 2, 0)) ```
For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor` offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes, or segmentation maps. ### Pad In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training time. This may cause images to be different sizes in a batch. You can use `DetrImageProcessor.pad()` from `DetrImageProcessor` and define a custom `collate_fn` to batch images together. ```py >>> def collate_fn(batch): ... pixel_values = [item["pixel_values"] for item in batch] ... encoding = image_processor.pad(pixel_values, return_tensors="pt") ... labels = [item["labels"] for item in batch] ... batch = {} ... batch["pixel_values"] = encoding["pixel_values"] ... batch["pixel_mask"] = encoding["pixel_mask"] ... batch["labels"] = labels ... return batch ``` ## Multimodal For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as tokenizer and feature extractor. Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the ๐Ÿค— [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR): ```py >>> from datasets import load_dataset >>> lj_speech = load_dataset("lj_speech", split="train") ``` For ASR, you're mainly focused on `audio` and `text` so you can remove the other columns: ```py >>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) ``` Now take a look at the `audio` and `text` columns: ```py >>> lj_speech[0]["audio"] {'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ..., 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', 'sampling_rate': 22050} >>> lj_speech[0]["text"] 'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' ``` Remember you should always [resample](preprocessing#audio) your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model! ```py >>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000)) ``` Load a processor with `AutoProcessor.from_pretrained()`: ```py >>> from transformers import AutoProcessor >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") ``` 1. Create a function to process the audio data contained in `array` to `input_values`, and tokenize `text` to `labels`. These are the inputs to the model: ```py >>> def prepare_dataset(example): ... audio = example["audio"] ... example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000)) ... return example ``` 2. Apply the `prepare_dataset` function to a sample: ```py >>> prepare_dataset(lj_speech[0]) ``` The processor has now added `input_values` and `labels`, and the sampling rate has also been correctly downsampled to 16kHz. You can pass your processed dataset to the model now! # Chatting with Transformers If you're reading this article, you're almost certainly aware of **chat models**. Chat models are conversational AIs that you can send and receive messages with. The most famous of these is the proprietary ChatGPT, but there are now many open-source chat models which match or even substantially exceed its performance. These models are free to download and run on a local machine. Although the largest and most capable models require high-powered hardware and lots of memory to run, there are smaller models that will run perfectly well on a single consumer GPU, or even an ordinary desktop or notebook CPU. This guide will help you get started with chat models. We'll start with a brief quickstart guide that uses a convenient, high-level "pipeline". This is all you need if you just want to start running a chat model immediately. After the quickstart, we'll move on to more detailed information about what exactly chat models are, how to choose an appropriate one, and a low-level breakdown of each of the steps involved in talking to a chat model. We'll also give some tips on optimizing the performance and memory usage of your chat models. ## Quickstart If you have no time for details, here's the brief summary: Chat models continue chats. This means that you pass them a conversation history, which can be as short as a single user message, and the model will continue the conversation by adding its response. Let's see this in action. First, let's build a chat: ```python chat = [ {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} ] ``` Notice that in addition to the user's message, we added a **system** message at the start of the conversation. Not all chat models support system messages, but when they do, they represent high-level directives about how the model should behave in the conversation. You can use this to guide the model - whether you want short or long responses, lighthearted or serious ones, and so on. If you want the model to do useful work instead of practicing its improv routine, you can either omit the system message or try a terse one such as "You are a helpful and intelligent AI assistant who responds to user queries." Once you have a chat, the quickest way to continue it is using the `TextGenerationPipeline`. Let's see this in action with `LLaMA-3`. Note that `LLaMA-3` is a gated model, which means you will need to [apply for access](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and log in with your Hugging Face account to use it. We'll also use `device_map="auto"`, which will load the model on GPU if there's enough memory for it, and set the dtype to `torch.bfloat16` to save memory: ```python import torch from transformers import pipeline pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto") response = pipe(chat, max_new_tokens=512) print(response[0]['generated_text'][-1]['content']) ``` And you'll get: ```text (sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright, alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide! So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got some wild stuff, like that Warhol guy's soup cans and all that jazz. And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind. Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks) And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing pizzerias around the city. Just don't try to order a "robot-sized" slice, trust me, it won't end well. (laughs) So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll excuse me, I've got some oil changes to attend to. (winks) ``` You can continue the chat by appending your own response to it. The `response` object returned by the pipeline actually contains the entire chat so far, so we can simply append a message and pass it back: ```python chat = response[0]['generated_text'] chat.append( {"role": "user", "content": "Wait, what's so wild about soup cans?"} ) response = pipe(chat, max_new_tokens=512) print(response[0]['generated_text'][-1]['content']) ``` And you'll get: ```text (laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man! It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!" (sarcastically) Oh, yeah, real original, Andy. But, you know, back in the '60s, it was like, a big deal. People were all about challenging the status quo, and Warhol was like, the king of that. He took the ordinary and made it extraordinary. And, let me tell you, it was like, a real game-changer. I mean, who would've thought that a can of soup could be art? (laughs) But, hey, you're not alone, pal. I mean, I'm a robot, and even I don't get it. (winks) But, hey, that's what makes art, art, right? (laughs) ``` The remainder of this tutorial will cover specific topics such as performance and memory, or how to select a chat model for your needs. ## Choosing a chat model There are an enormous number of different chat models available on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), and new users often feel very overwhelmed by the selection offered. Don't be, though! You really need to just focus on two important considerations: - The model's size, which will determine if you can fit it in memory and how quickly it will run. - The quality of the model's chat output. In general, these are correlated - bigger models tend to be more capable, but even so there's a lot of variation at a given size point! ### Size and model naming The size of a model is easy to spot - it's the number in the model name, like "8B" or "70B". This is the number of **parameters** in the model. Without quantization, you should expect to need about 2 bytes of memory per parameter. This means that an "8B" model with 8 billion parameters will need about 16GB of memory just to fit the parameters, plus a little extra for other overhead. It's a good fit for a high-end consumer GPU with 24GB of memory, such as a 3090 or 4090. Some chat models are "Mixture of Experts" models. These may list their sizes in different ways, such as "8x7B" or "141B-A35B". The numbers are a little fuzzier here, but in general you can read this as saying that the model has approximately 56 (8x7) billion parameters in the first case, or 141 billion parameters in the second case. Note that it is very common to use quantization techniques to reduce the memory usage per parameter to 8 bits, 4 bits, or even less. This topic is discussed in more detail in the [Memory considerations](#memory-considerations) section below. ### But which chat model is best? Even once you know the size of chat model you can run, there's still a lot of choice out there. One way to sift through it all is to consult **leaderboards**. Two of the most popular leaderboards are the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and the [LMSys Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). Note that the LMSys leaderboard also includes proprietary models - look at the `licence` column to identify open-source ones that you can download, then search for them on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). ### Specialist domains Some models may be specialized for certain domains, such as medical or legal text, or non-English languages. If you're working in these domains, you may find that a specialized model will give you big performance benefits. Don't automatically assume that, though! Particularly when specialized models are smaller or older than the current cutting-edge, a top-end general-purpose model may still outclass them. Thankfully, we are beginning to see [domain-specific leaderboards](https://huggingface.co/blog/leaderboard-medicalllm) that should make it easier to locate the best models for specialized domains. ## What happens inside the pipeline? The quickstart above used a high-level pipeline to chat with a chat model, which is convenient, but not the most flexible. Let's take a more low-level approach, to see each of the steps involved in chat. Let's start with a code sample, and then break it down: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Prepare the input as before chat = [ {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} ] # 1: Load the model and tokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") # 2: Apply the chat template formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) print("Formatted chat:\n", formatted_chat) # 3: Tokenize the chat (This can be combined with the previous step using tokenize=True) inputs = tokenizer(formatted_chat, return_tensors="pt", add_special_tokens=False) # Move the tokenized inputs to the same device the model is on (GPU/CPU) inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()} print("Tokenized inputs:\n", inputs) # 4: Generate text from the model outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1) print("Generated tokens:\n", outputs) # 5: Decode the output back to a string decoded_output = tokenizer.decode(outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True) print("Decoded output:\n", decoded_output) ``` There's a lot in here, each piece of which could be its own document! Rather than going into too much detail, I'll cover the broad ideas, and leave the details for the linked documents. The key steps are: 1. [Models](https://huggingface.co/learn/nlp-course/en/chapter2/3) and [Tokenizers](https://huggingface.co/learn/nlp-course/en/chapter2/4?fw=pt) are loaded from the Hugging Face Hub. 2. The chat is formatted using the tokenizer's [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) 3. The formatted chat is [tokenized](https://huggingface.co/learn/nlp-course/en/chapter2/4) using the tokenizer. 4. We [generate](https://huggingface.co/docs/transformers/en/llm_tutorial) a response from the model. 5. The tokens output by the model are decoded back to a string ## Performance, memory and hardware You probably know by now that most machine learning tasks are run on GPUs. However, it is entirely possible to generate text from a chat model or language model on a CPU, albeit somewhat more slowly. If you can fit the model in GPU memory, though, this will usually be the preferable option. ### Memory considerations By default, Hugging Face classes like `TextGenerationPipeline` or `AutoModelForCausalLM` will load the model in `float32` precision. This means that it will need 4 bytes (32 bits) per parameter, so an "8B" model with 8 billion parameters will need ~32GB of memory. However, this can be wasteful! Most modern language models are trained in "bfloat16" precision, which uses only 2 bytes per parameter. If your hardware supports it (Nvidia 30xx/Axxx or newer), you can load the model in `bfloat16` precision, using the `torch_dtype` argument as we did above. It is possible to go even lower than 16-bits using "quantization", a method to lossily compress model weights. This allows each parameter to be squeezed down to 8 bits, 4 bits or even less. Note that, especially at 4 bits, the model's outputs may be negatively affected, but often this is a tradeoff worth making to fit a larger and more capable chat model in memory. Let's see this in action with `bitsandbytes`: ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) # You can also try load_in_4bit model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", quantization_config=quantization_config) ``` Or we can do the same thing using the `pipeline` API: ```python from transformers import pipeline, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) # You can also try load_in_4bit pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config}) ``` There are several other options for quantizing models besides `bitsandbytes` - please see the [Quantization guide](./quantization) for more information. ### Performance considerations For a more extensive guide on language model performance and optimization, check out [LLM Inference Optimization](./llm_optims) . As a general rule, larger chat models will be slower in addition to requiring more memory. It's possible to be more concrete about this, though: Generating text from a chat model is unusual in that it is bottlenecked by **memory bandwidth** rather than compute power, because every active parameter must be read from memory for each token that the model generates. This means that number of tokens per second you can generate from a chat model is generally proportional to the total bandwidth of the memory it resides in, divided by the size of the model. In our quickstart example above, our model was ~16GB in size when loaded in `bfloat16` precision. This means that 16GB must be read from memory for every token generated by the model. Total memory bandwidth can vary from 20-100GB/sec for consumer CPUs to 200-900GB/sec for consumer GPUs, specialized CPUs like Intel Xeon, AMD Threadripper/Epyc or high-end Apple silicon, and finally up to 2-3TB/sec for data center GPUs like the Nvidia A100 or H100. This should give you a good idea of the generation speed you can expect from these different hardware types. Therefore, if you want to improve the speed of text generation, the easiest solution is to either reduce the size of the model in memory (usually by quantization), or get hardware with higher memory bandwidth. For advanced users, several other techniques exist to get around this bandwidth bottleneck. The most common are variants on [assisted generation](https://huggingface.co/blog/assisted-generation), also known as "speculative sampling". These techniques try to guess multiple future tokens at once, often using a smaller "draft model", and then confirm these generations with the chat model. If the guesses are validated by the chat model, more than one token can be generated per forward pass, which greatly alleviates the bandwidth bottleneck and improves generation speed. Finally, we should also note the impact of "Mixture of Experts" (MoE) models here. Several popular chat models, such as Mixtral, Qwen-MoE and DBRX, are MoE models. In these models, not every parameter is active for every token generated. As a result, MoE models generally have much lower memory bandwidth requirements, even though their total size can be quite large. They can therefore be several times faster than a normal "dense" model of the same size. However, techniques like assisted generation are generally ineffective for these models because more parameters will become active with each new speculated token, which will negate the bandwidth and speed benefits that the MoE architecture provides. # Agents and tools ### What is an agent? Large Language Models (LLMs) trained to perform [causal language modeling](./tasks/language_modeling) can tackle a wide range of tasks, but they often struggle with basic tasks like logic, calculation, and search. When prompted in domains in which they do not perform well, they often fail to generate the answer we expect them to. One approach to overcome this weakness is to create an *agent*. An agent is a system that uses an LLM as its engine, and it has access to functions called *tools*. These *tools* are functions for performing a task, and they contain all necessary description for the agent to properly use them. The agent can be programmed to: - devise a series of actions/tools and run them all at once, like the `CodeAgent` - plan and execute actions/tools one by one and wait for the outcome of each action before launching the next one, like the `ReactJsonAgent` ### Types of agents #### Code agent This agent has a planning step, then generates python code to execute all its actions at once. It natively handles different input and output types for its tools, thus it is the recommended choice for multimodal tasks. #### React agents This is the go-to agent to solve reasoning tasks, since the ReAct framework ([Yao et al., 2022](https://huggingface.co/papers/2210.03629)) makes it really efficient to think on the basis of its previous observations. We implement two versions of ReactJsonAgent: - `ReactJsonAgent` generates tool calls as a JSON in its output. - `ReactCodeAgent` is a new type of ReactJsonAgent that generates its tool calls as blobs of code, which works really well for LLMs that have strong coding performance. > [!TIP] > Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about ReAct agents.
![Framework of a React Agent](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/open-source-llms-as-agents/ReAct.png) For example, here is how a ReAct Code agent would work its way through the following question. ```py3 >>> agent.run( ... "How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need?", ... ) =====New task===== How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need? ====Agent is executing the code below: bert_blocks = search(query="number of blocks in BERT base encoder") print("BERT blocks:", bert_blocks) ==== Print outputs: BERT blocks: twelve encoder blocks ====Agent is executing the code below: attention_layer = search(query="number of layers in Attention is All You Need") print("Attention layers:", attention_layer) ==== Print outputs: Attention layers: Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- 2 Page 3 Figure 1: The Transformer - model architecture. ====Agent is executing the code below: bert_blocks = 12 attention_layers = 6 diff = bert_blocks - attention_layers print("Difference in blocks:", diff) final_answer(diff) ==== Print outputs: Difference in blocks: 6 Final answer: 6 ``` ### How can I build an agent? To initialize an agent, you need these arguments: - an LLM to power your agent - the agent is not exactly the LLM, itโ€™s more like the agent is a program that uses an LLM as its engine. - a system prompt: what the LLM engine will be prompted with to generate its output - a toolbox from which the agent pick tools to execute - a parser to extract from the LLM output which tools are to call and with which arguments Upon initialization of the agent system, the tool attributes are used to generate a tool description, then baked into the agentโ€™s `system_prompt` to let it know which tools it can use and why. To start with, please install the `agents` extras in order to install all default dependencies. ```bash pip install transformers[agents] ``` Build your LLM engine by defining a `llm_engine` method which accepts a list of [messages](./chat_templating) and returns text. This callable also needs to accept a `stop` argument that indicates when to stop generating. ```python from huggingface_hub import login, InferenceClient login("") client = InferenceClient(model="meta-llama/Meta-Llama-3-70B-Instruct") def llm_engine(messages, stop_sequences=["Task"]) -> str: response = client.chat_completion(messages, stop=stop_sequences, max_tokens=1000) answer = response.choices[0].message.content return answer ``` You could use any `llm_engine` method as long as: 1. it follows the [messages format](./chat_templating) (`List[Dict[str, str]]`) for its input `messages`, and it returns a `str`. 2. it stops generating outputs at the sequences passed in the argument `stop_sequences` Additionally, `llm_engine` can also take a `grammar` argument. In the case where you specify a `grammar` upon agent initialization, this argument will be passed to the calls to llm_engine, with the `grammar` that you defined upon initialization, to allow [constrained generation](https://huggingface.co/docs/text-generation-inference/conceptual/guidance) in order to force properly-formatted agent outputs. You will also need a `tools` argument which accepts a list of `Tools` - it can be an empty list. You can also add the default toolbox on top of your `tools` list by defining the optional argument `add_base_tools=True`. Now you can create an agent, like `CodeAgent`, and run it. You can also create a `TransformersEngine` with a pre-initialized pipeline to run inference on your local machine using `transformers`. For convenience, since agentic behaviours generally require stronger models such as `Llama-3.1-70B-Instruct` that are harder to run locally for now, we also provide the `HfApiEngine` class that initializes a `huggingface_hub.InferenceClient` under the hood. ```python from transformers import CodeAgent, HfApiEngine llm_engine = HfApiEngine(model="meta-llama/Meta-Llama-3-70B-Instruct") agent = CodeAgent(tools=[], llm_engine=llm_engine, add_base_tools=True) agent.run( "Could you translate this sentence from French, say it out loud and return the audio.", sentence="Oรน est la boulangerie la plus proche?", ) ``` This will be handy in case of emergency baguette need! You can even leave the argument `llm_engine` undefined, and an `HfApiEngine` will be created by default. ```python from transformers import CodeAgent agent = CodeAgent(tools=[], add_base_tools=True) agent.run( "Could you translate this sentence from French, say it out loud and give me the audio.", sentence="Oรน est la boulangerie la plus proche?", ) ``` Note that we used an additional `sentence` argument: you can pass text as additional arguments to the model. You can also use this to indicate the path to local or remote files for the model to use: ```py from transformers import ReactCodeAgent agent = ReactCodeAgent(tools=[], llm_engine=llm_engine, add_base_tools=True) agent.run("Why does Mike not know many people in New York?", audio="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/recording.mp3") ``` The prompt and output parser were automatically defined, but you can easily inspect them by calling the `system_prompt_template` on your agent. ```python print(agent.system_prompt_template) ``` It's important to explain as clearly as possible the task you want to perform. Every `run()` operation is independent, and since an agent is powered by an LLM, minor variations in your prompt might yield completely different results. You can also run an agent consecutively for different tasks: each time the attributes `agent.task` and `agent.logs` will be re-initialized. #### Code execution A Python interpreter executes the code on a set of inputs passed along with your tools. This should be safe because the only functions that can be called are the tools you provided (especially if it's only tools by Hugging Face) and the print function, so you're already limited in what can be executed. The Python interpreter also doesn't allow imports by default outside of a safe list, so all the most obvious attacks shouldn't be an issue. You can still authorize additional imports by passing the authorized modules as a list of strings in argument `additional_authorized_imports` upon initialization of your `ReactCodeAgent` or `CodeAgent`: ```py >>> from transformers import ReactCodeAgent >>> agent = ReactCodeAgent(tools=[], additional_authorized_imports=['requests', 'bs4']) >>> agent.run("Could you get me the title of the page at url 'https://huggingface.co/blog'?") (...) 'Hugging Face โ€“ Blog' ``` The execution will stop at any code trying to perform an illegal operation or if there is a regular Python error with the code generated by the agent. > [!WARNING] > The LLM can generate arbitrary code that will then be executed: do not add any unsafe imports! ### The system prompt An agent, or rather the LLM that drives the agent, generates an output based on the system prompt. The system prompt can be customized and tailored to the intended task. For example, check the system prompt for the `ReactCodeAgent` (below version is slightly simplified). ```text You will be given a task to solve as best you can. You have access to the following tools: <> To solve the task, you must plan forward to proceed in a series of steps, in a cycle of 'Thought:', 'Code:', and 'Observation:' sequences. At each step, in the 'Thought:' sequence, you should first explain your reasoning towards solving the task, then the tools that you want to use. Then in the 'Code:' sequence, you shold write the code in simple Python. The code sequence must end with '/End code' sequence. During each intermediate step, you can use 'print()' to save whatever important information you will then need. These print outputs will then be available in the 'Observation:' field, for using this information as input for the next step. In the end you have to return a final answer using the `final_answer` tool. Here are a few examples using notional tools: --- {examples} Above example were using notional tools that might not exist for you. You only have acces to those tools: <> You also can perform computations in the python code you generate. Always provide a 'Thought:' and a 'Code:\n```py' sequence ending with '```' sequence. You MUST provide at least the 'Code:' sequence to move forward. Remember to not perform too many operations in a single code block! You should split the task into intermediate code blocks. Print results at the end of each step to save the intermediate results. Then use final_answer() to return the final result. Remember to make sure that variables you use are all defined. Now Begin! ``` The system prompt includes: - An *introduction* that explains how the agent should behave and what tools are. - A description of all the tools that is defined by a `<>` token that is dynamically replaced at runtime with the tools defined/chosen by the user. - The tool description comes from the tool attributes, `name`, `description`, `inputs` and `output_type`, and a simple `jinja2` template that you can refine. - The expected output format. You could improve the system prompt, for example, by adding an explanation of the output format. For maximum flexibility, you can overwrite the whole system prompt template by passing your custom prompt as an argument to the `system_prompt` parameter. ```python from transformers import ReactJsonAgent from transformers.agents import PythonInterpreterTool agent = ReactJsonAgent(tools=[PythonInterpreterTool()], system_prompt="{your_custom_prompt}") ``` > [!WARNING] > Please make sure to define the `<>` string somewhere in the `template` so the agent is aware of the available tools. ### Inspecting an agent run Here are a few useful attributes to inspect what happened after a run: - `agent.logs` stores the fine-grained logs of the agent. At every step of the agent's run, everything gets stored in a dictionary that then is appended to `agent.logs`. - Running `agent.write_inner_memory_from_logs()` creates an inner memory of the agent's logs for the LLM to view, as a list of chat messages. This method goes over each step of the log and only stores what it's interested in as a message: for instance, it will save the system prompt and task in separate messages, then for each step it will store the LLM output as a message, and the tool call output as another message. Use this if you want a higher-level view of what has happened - but not every log will be transcripted by this method. ## Tools A tool is an atomic function to be used by an agent. You can for instance check the `PythonInterpreterTool`: it has a name, a description, input descriptions, an output type, and a `__call__` method to perform the action. When the agent is initialized, the tool attributes are used to generate a tool description which is baked into the agent's system prompt. This lets the agent know which tools it can use and why. ### Default toolbox Transformers comes with a default toolbox for empowering agents, that you can add to your agent upon initialization with argument `add_base_tools = True`: - **Document question answering**: given a document (such as a PDF) in image format, answer a question on this document ([Donut](./model_doc/donut)) - **Image question answering**: given an image, answer a question on this image ([VILT](./model_doc/vilt)) - **Speech to text**: given an audio recording of a person talking, transcribe the speech into text ([Whisper](./model_doc/whisper)) - **Text to speech**: convert text to speech ([SpeechT5](./model_doc/speecht5)) - **Translation**: translates a given sentence from source language to target language. - **DuckDuckGo search***: performs a web search using DuckDuckGo browser. - **Python code interpreter**: runs your the LLM generated Python code in a secure environment. This tool will only be added to `ReactJsonAgent` if you initialize it with `add_base_tools=True`, since code-based agent can already natively execute Python code You can manually use a tool by calling the `load_tool()` function and a task to perform. ```python from transformers import load_tool tool = load_tool("text-to-speech") audio = tool("This is a text to speech tool") ``` ### Create a new tool You can create your own tool for use cases not covered by the default tools from Hugging Face. For example, let's create a tool that returns the most downloaded model for a given task from the Hub. You'll start with the code below. ```python from huggingface_hub import list_models task = "text-classification" model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) print(model.id) ``` This code can quickly be converted into a tool, just by wrapping it in a function and adding the `tool` decorator: ```py from transformers import tool @tool def model_download_tool(task: str) -> str: """ This is a tool that returns the most downloaded model of a given task on the Hugging Face Hub. It returns the name of the checkpoint. Args: task: The task for which """ model = next(iter(list_models(filter="text-classification", sort="downloads", direction=-1))) return model.id ``` The function needs: - A clear name. The name usually describes what the tool does. Since the code returns the model with the most downloads for a task, let's put `model_download_tool`. - Type hints on both inputs and output - A description, that includes an 'Args:' part where each argument is described (without a type indication this time, it will be pulled from the type hint). All these will be automatically baked into the agent's system prompt upon initialization: so strive to make them as clear as possible! > [!TIP] > This definition format is the same as tool schemas used in `apply_chat_template`, the only difference is the added `tool` decorator: read more on our tool use API [here](https://huggingface.co/blog/unified-tool-use#passing-tools-to-a-chat-template). Then you can directly initialize your agent: ```py from transformers import CodeAgent agent = CodeAgent(tools=[model_download_tool], llm_engine=llm_engine) agent.run( "Can you give me the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub?" ) ``` You get the following: ```text ======== New task ======== Can you give me the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub? ==== Agent is executing the code below: most_downloaded_model = model_download_tool(task="text-to-video") print(f"The most downloaded model for the 'text-to-video' task is {most_downloaded_model}.") ==== ``` And the output: `"The most downloaded model for the 'text-to-video' task is ByteDance/AnimateDiff-Lightning."` ### Manage your agent's toolbox If you have already initialized an agent, it is inconvenient to reinitialize it from scratch with a tool you want to use. With Transformers, you can manage an agent's toolbox by adding or replacing a tool. Let's add the `model_download_tool` to an existing agent initialized with only the default toolbox. ```python from transformers import CodeAgent agent = CodeAgent(tools=[], llm_engine=llm_engine, add_base_tools=True) agent.toolbox.add_tool(model_download_tool) ``` Now we can leverage both the new tool and the previous text-to-speech tool: ```python agent.run( "Can you read out loud the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub and return the audio?" ) ``` | **Audio** | |------------------------------------------------------------------------------------------------------------------------------------------------------| |