Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ZeyuMi authored Dec 16, 2023
1 parent b4f4f64 commit cff625c
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ only 18\% lower than that achieved by a top-tier server-grade A100 GPU.
This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

## Feature
PowerInfer is a fast and easy-to-use inference engine for deploying LLM locally. Interestingly, we observe that in ReLU LLM, every neuron is an expert! And a small subset of neurons consistently contributes to the output.
PowerInfer is a high-speed and easy-to-use inference engine for deploying LLM locally. Interestingly, we observe that in ReLU LLM, every neuron is an expert! And a small subset of neurons consistently contributes to the output.
PowerInfer is fast with:

- Exploiting the high locality in LLM infernece
- Exploiting the high locality in LLM inference
- Neuron-aware hybrid CPU/GPU sparse operator
- Neuron granularity offloading

Expand Down Expand Up @@ -79,7 +79,7 @@ cmake --build build --config Release
```

## Model Weights
As for now, we have't released predictor training code, we suggest you can download the sparse-model from huggingface in the following link.
As for now, we have not released the predictor training code, we suggest you download the sparse model from huggingface in the following link.
| Base Model | GGUF Format Link | Original Model |
|------------|------------------|----------------|
| LLaMA(ReLU)-2-7B | [PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF](https://huggingface.co/PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF) | [SparseLLM/ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) |
Expand All @@ -96,11 +96,11 @@ As for now, we have't released predictor training code, we suggest you can downl
./build/bin/main -m /PATH/TO/MODEL -n $(output_token_count) -t $(thread_num) -p $(prompt) --vram-budget $(GPU_VRAM_OFFLOADING)
```

As for now, it requires a offline-generated "GPU index" file to split FFNs on GPU. If you want to try it, please use the following instruction to generate the GPU index file:
As for now, it requires an offline-generated "GPU index" file to split FFNs on GPU. If you want to try it, please use the following instructions to generate the GPU index file:
```bash
python scripts/export-gpu-split.py $(activation_count_path) $(output_idx_path) solver
```
Then, you can use the following instruction to run PowerInfer with GPU index:
Then, you can use the following instructions to run PowerInfer with GPU index:
```bash
./build/bin/main -m /PATH/TO/MODEL -n $(output_token_count) -t $(thread_num) -p $(prompt) --gpu-index $(split_path)
```
Expand All @@ -111,7 +111,7 @@ Then, you can use the following instruction to run PowerInfer with GPU index:

![github-eval-2080ti-q4](https://github.com/SJTU-IPADS/PowerInfer/assets/34213478/0fc1bfc4-aafc-4e82-a865-bec0143aff1a)

PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4 model!
PowerInfer achieves up to 11.69x and 8.00x speedup for FP16 and INT4 models!

## TODOs
We will release the code and data in the following order, please stay tuned!
Expand Down

0 comments on commit cff625c

Please sign in to comment.