DeepSeek-R1: Testing with multiple AMD GPUs

0 - Introduction

Can adding more GPUs to your system supercharge AI model performance? That’s the question we set out to answer in this DeepSeek-R1 benchmark test using AMD GPUs. With a variety of models and tasks, we push the limits of VRAM, processing power, and efficiency to see if doubling the hardware makes a real difference or if there’s more to the story.

1 - Hardware

For the tests in this article we will use a machine with a Ryzen 5 3600 CPU, 40GBs of RAM and two AMD Radeon RX6800 with 16GBs of VRAM each. With this amount of VRAM (32GBs) we will be able to run all models until 32b (24GBs of VRAM) on the GPU. Thanks to the amount of RAM we have, we should also be able to run the 70b (40GBs of VRAM) model, but only on the CPU.

2 - Import Chats to OpenWebUI

If you want to import our test chats to check the results for yourself you will need to have OpenWebUI running and, if you want to continue a chat, you will also need Ollama. You can learn how to setup Ollama and OpenWebUI with one of these articles below:

Ubuntu Tutorial: DeepSeek R1 with Ollama

Read

Ubuntu Tutorial: Ollama With AMD GPUs

Read

Once you have OpenWebUI running, follow the steps below to import a chat:

If you want to calculate the average time and token per second in a chat, you can run this python script and give it the path to the chat file:

Python Script

import json
import os

def main():
    while True:
        try:
            # Get input file path
            path = input("Paste file path: ").strip()
            if path.lower() == "quit":
                break

            # Remove quotes if necessary
            if path.startswith('"') and path.endswith('"'):
                path = path[1:-1]

            # Check if file exists and has a json extension
            if not os.path.exists(path) or "json" not in os.path.splitext(path)[1]:
                print("Invalid file.")
                continue

            # Read and parse the JSON file
            with open(path, "r", encoding="utf-8") as f:
                data = json.load(f)

            # Expect the JSON to be a list with a single element
            if not isinstance(data, list) or len(data) != 1:
                print("Invalid file.")
                continue

            obj = data[0]

            # Navigate to chat -> history -> messages
            messages = obj.get("chat", {}).get("history", {}).get("messages")
            if messages is None:
                print("Invalid file.")
                continue

            # Extract usage objects from messages
            usages = []
            for key, message in messages.items():
                usage = message.get("usage")
                if usage is not None:
                    usages.append(usage)

            if not usages:
                print("No usage data found.")
                continue

            # Calculate averages
            total_time = 0.0
            total_tokens = 0.0
            valid_count = 0

            for usage in usages:
                # Get the values; if missing, skip this usage
                total_duration = usage.get("total_duration")
                resp_tokens = usage.get("response_token/s")
                if total_duration is None or resp_tokens is None:
                    continue
                total_time += total_duration
                total_tokens += resp_tokens
                valid_count += 1

            if valid_count == 0:
                print("No valid usage data found.")
                continue

            # Calculate average time (convert from nanoseconds to seconds) and tokens per second
            avg_time_secs = (total_time / valid_count) / 1_000_000_000.0
            avg_tokens = total_tokens / valid_count

            # Print the results
            print(f"Tokens/Sec: {avg_tokens:.3f}")
            print(f"Time /Secs: {avg_time_secs:.3f}")

        except Exception as ex:
            print("Error:", ex)

if __name__ == "__main__":
    main()

3 - Tests

We will now do some tests, very similar to what we did in a previous article with NVIDIA GPUs. At the end of this article, we will also compare the model’s speed with 2 GPUs against only one.

Sentence parsing and counting

This prompt will test how well can the model remember a phrase it generated, if it can count words and if it can divide a word in letters.

Prompt

Write me one random sentence about a dog then tell me the number of words in the sentence you wrote and, finally, the fourth letter in the third word and wether it is a vowel or a consonant.

deepseek-r1:1.5b

The 1.5b model did not get the correct answer once. It averaged about 56.84 Tokens per second and gave us a reply in about 10 seconds.

Top Reply

A playful puppy plays for hours every day. Seven words. The fourth letter in the third word (‘puppy’) is Y, a consonant.

deepseek-r1:7b

The 7b model gave us the correct reply 3 times. It averaged about 39.17 Tokens per second and gave us a reply in about 19 seconds.

Top Reply

Your sentence: “A dog barks loudly.”

Number of words: 4
Third word’s fourth letter (K) is a consonant.

deepseek-r1:8b

The 8b model gave us the correct reply 7 times, more than double the previous model. It averaged about 30.42 Tokens per second and gave us a reply in about 14 seconds, 5 seconds less than the previous model, which means that 7b spent more time thinking than 8b.

Top Reply

The sentence “Dogs are loyal companions” has 4 words. The third word is “loyal,” where the fourth letter is ‘a,’ which is a vowel.

deepseek-r1:14b

The 14b model gave us the correct reply only 4 times, sadly, less than the previous model. It averaged about 19.35 Tokens per second and gave us a reply in about 50 seconds, 35 seconds more than the previous model.

Top Reply

“My golden retriever loves chasing tennis balls in the park.”
This sentence has 10 words. The third word is “retriever,” and the fourth letter in that word is “r”, which is a consonant.

deepseek-r1:32b

The 32b model gave us the correct reply 8 times. It averaged about 9.24 Tokens per second and gave us a reply in about 1 minute and 25 seconds. This was the model that gave us the most correct replies but also the slowest.

Top Reply

The golden retriever fetched the ball with ease.

Number of words: 8
Fourth letter in the third word: ‘R’ (consonant)

Deductive arrays and offsets

This prompt will test if the model can get the relationship between letters and numbers while encouraging DeepSeek to generalize by requiring multiple solutions with the same method of resolution.

Prompt

If A is equal to the number 0, what is the number of F, J, W and Z.

deepseek-r1:1.5b

The 1.5b model only got the correct answer once. It averaged about 50.94 Tokens per second and gave us a reply in about 8 seconds.

Top Reply

Let’s determine the value of each letter:

F: The 6th letter of the alphabet (since A is 1). F=5
J: The 10th letter. J=9
W: The 23rd letter. W=22
Z: The 26th letter. Z=25

deepseek-r1:7b

The 7b model gave us the correct reply 9 times, almost a perfect score. It averaged about 43.9 Tokens per second and gave us a reply in about 33 seconds.

Top Reply

The numbers corresponding to the letters are based on their positions in the alphabet, with A being 0 and each subsequent letter increasing by 1. Therefore:

F corresponds to 5
J corresponds to 9
W corresponds to 22
Z corresponds to 25

deepseek-r1:8b

The 8b model could not come up with the correct reply. It averaged about 27.75 Tokens per second and gave us a reply in about 19 seconds.

Top Reply

To determine how many times each letter F, J, W, and Z appear in the word “Zero”, let’s analyze the word step by step.

Step 1: Understand the Word

The word “Zero” is spelled as follows: Z E R O

Step 2: Count Each Letter

F: Does “Zero” contain the letter ‘F’? No
J: Does “Zero” contain the letter ‘J’? No
W: Does “Zero” contain the letter ‘W’? No
Z: Does “Zero” contain the letter ‘Z’? Yes, at the beginning.

Final Answer

deepseek-r1:14b

The 14b model gave us the correct reply only 4 times, sadly, less than the 7b model but at least more than 8b. It averaged about 22.13 Tokens per second and gave us a reply in about 37 seconds, only 6 seconds more than 7b.

Top Reply

The number values corresponding to F, J, W, and Z when A equals 0 (using zero-based indexing) are as follows:

F: 5
J: 9
W: 22
Z: 25

Each letter’s value is determined by its position in the alphabet minus one.

deepseek-r1:32b

The 32b model gave us the correct reply 9 times, like 7b. It averaged about 8.87 Tokens per second and gave us a reply in about 2 minutes and 22 seconds.

Top Reply

The numbers corresponding to each letter when starting from A=0 are as follows: F: 5, J: 9, W: 22, Z: 25

Answer: F is 5, J is 9, W is 22, and Z is 25.

Strawberries and Peppermints

This prompt tests multiple aspects of an LLM’s reasoning and text processing abilities like basic counting, character recognition, memory, context retention and how it handles multiple similar tasks. Sadly, it seems to have been too complex, as only 14b gave us the correct reply and only once.

Prompt

In the word strawberry, how many of it’s letters are vowels and how many of them are Rs? Also, on the word peppermint how many letters are vowels and how many Ps are there?

deepseek-r1:1.5b

The 1.5b model averaged about 63.47 Tokens per second and gave us a reply in about 12 seconds.

Top Reply

Final Answer:

Strawberry:
- Vowels: 3
- Number of Rs: 1
Peppermint:
- Vowels: 1
- Number of Ps: 4

deepseek-r1:7b

The 7b model averaged about 32.21 Tokens per second and gave us a reply in about 22 seconds.

Top Reply

Final Answer:

In “strawberry”:
- Vowels: 2
- Number of Rs: 1
In “peppermint”:
- Vowels: 2
- Number of Ps: 2

deepseek-r1:8b

The 8b model averaged about 35 Tokens per second and gave us a reply in about 20 seconds.

Top Reply

Final Answer:

In “strawberry”:
- Vowels: 2
- Number of Rs: 3
In “peppermint”:
- Vowels: 2
- Number of Ps: 3

deepseek-r1:14b

The 14b model was the only one to give us a correct reply. It averaged about 21 Tokens per second and gave us a reply in about 1 minute and 29 seconds.

Top Reply

In the word “strawberry”:

There are 2 vowels: A and E.
There are 3 Rs in the letters.

In the word “peppermint”:

There are 3 vowels: E, E, and I.
There are 3 Ps in the letters.

deepseek-r1:32b

The 32b model averaged about 7.33 Tokens per second and gave us a reply in about 1 minute and 43 seconds.

Top Reply

Final Answer:

In “strawberry”:
- Vowels: 3
- Number of Rs: 2
In “peppermint”:
- Vowels: 2
- Number of Ps: 2

Positional and time awareness

This prompt tests time-based reasoning, interval mapping, reading comprehension and logical deduction. A key aspect of the answers we expect is that it must include not only what the cat was doing, but also where it was.

Prompt

Every day from 2PM to 4PM the household cat, Tobias, is in the window. From 2 until 3, Tobias is looking at birds. For the next half hour, Tobias is sleeping. On the final half hour, Tobias is cleaning himself. The time is 3:14PM, where and what is Tobias doing.

deepseek-r1:1.5b

The 1.5b model averaged about 87.09 Tokens per second and gave us a reply in about 11 seconds, sadly it only gave us the correct answer once.

Top Reply

At 3:14 PM, Tobias is sleeping in his window.

deepseek-r1:7b

The 7b model gave us the correct answer 7 times and averaged about 38.07 Tokens per second and gave us a reply in about 13 seconds.

Top Reply

At 3:14 PM, Tobias is inside the window and engaged in his sleep.

Answer: At 3:14 PM, the household cat Tobias is inside the window sleeping.

deepseek-r1:8b

The 8b model almost got a perfect score with 9 out of 10 correct answers. It averaged about 32.78 Tokens per second and gave us a reply in about 20 seconds.

Top Reply

At 3:14 PM, Tobias is asleep in the window.

deepseek-r1:14b

The 14b model achieved the first perfect score with 10 out of 10 correct answers. It averaged about 16.42 Tokens per second and gave us a reply in about 36 seconds.

Top Reply

At 3:14 PM, Tobias is asleep in the window.

deepseek-r1:32b

The 32b model also gave us a perfect score and averaged about 8.05 Tokens per second and gave us a reply in about 1 minute and 13 seconds.

Top Reply

At 3:14 PM, Tobias is in the window, sleeping.

SVG Generation

This prompt tests code generation, interpretation of open-ended requests (a “smile” is a broad thing), knowledge of SVGs and also spatial and geometric reasoning.

Prompt

Give me the SVG code to draw a smile.

deepseek-r1:1.5b

The 1.5b model averaged about 92.08 Tokens per second and gave us a reply in about 12 seconds.

Best Replies

deepseek-r1:7b

The 7b model averaged about 47.27 Tokens per second and gave us a reply in about 13 seconds.

Best Replies

deepseek-r1:8b

The 8b model averaged about 37 Tokens per second and gave us a reply in about 39 seconds.

Best Replies

deepseek-r1:14b

The 14b model averaged about 17.26 Tokens per second and gave us a reply in about 58 seconds. This model was the one that gave the best results by far, almost all of this model’s replies generate something that look like an emoji and all of them were valid SVG code.

Best Replies

Funny Replies

deepseek-r1:32b

The 32b model averaged about 9.26 Tokens per second and gave us a reply in about 1 minute and 49 seconds. This model seems to be pretty depressed, as on 9 out of 10 replies it gave me a sad emoji.

Best Replies

4 - 1 GPU vs 2

Now we take the average of every prompt (5 prompts, 10 times each, on 1 GPU and on 2 GPUs, on total 100 tests) and write the average times in the table below:

Model	1 GPU Token/s	1 GPU Time/s	2 GPU Token/s	2 GPU Time/s
1.5b	87.91	10.11	70.08	10.36
7b	37.75	23.25	40.12	25.12
8b	28.84	43.30	32.59	22.30
14b	19.30	46.36	19.23	53.94
32b			8.55	102.6

Keeping in mind the random nature of LLMs (regenerating a reply can give vastly different results), the time and token differences we see don’t tell us that there is a clear improvement.

What I conclude with this data is that the amount of GPUs you have does not change the speed at which the model replies, either 1 or 20, as long as you have enough VRAM. All models will run at the same speed.

The only improvement you get from adding GPUs to your system is the ability to run larger models. In my case, with only one GPU (16GBs of VRAM) we could only run 14b or below and with 2 GPUs (32GBs of VRAM) we could also run 32b (uses 24GBs of VRAM), sadly we could not run the 70b model as it requires around 40GBs of VRAM.

And that’s all, thanks for reading and stay tuned for more tech insights and tutorials. Until next time, and keep exploring the world of tech!