How to Use Jarvis, Microsoft's One AI Bot to Rule Them All

HuggingGPT Jarvis
(Image credit: Tom's Hardware)

With all the talk about chatbots such as ChatGPT, it's easy to forget that text-based chat is just one of many AI functions. The ideal generative AI would be able to work across different models as needed, interpreting and generating images, audio and video.

Enter Jarvis, a new project from Microsoft that promises one bot to rule them all. Jarvis uses ChatGPT as the controller for a system where it can employ a variety of other models as needed to respond to your prompt. In a paper published by Cornell University, Microsoft researchers (Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu and Yueting Zhuang)  explain how this framework works. A user makes a request to the bot, it plans the task, chooses which models it needs, has those models perform the task and then generates and issues a response. 

The chart below, provided in the research paper, shows how this process works in the real world. A user asks the bot to create an image where a girl is reading a book and she is positioned the same way that a boy is in a sample image. The bot plans the task, uses a model to interpret the boy's pose in the original image and then deploys another model to draw the output.

Microsoft Jarvis Process

(Image credit: Microsoft Research)

Microsoft has a Github page where you can download and try out Jarvis on a Linux-powered PC. The company recommends you use Ubuntu (the outdated version 16 LTS specifically), but I was able to get the main feature of it -- a terminal based chatbot -- working on Ubuntu 22.04 LTS and on Windows Subsystem for Linux.

However, unless you really like the idea of messing with configuration files, the best way to check out Jarvis is by using HuggingGPT, a web-based chatbot that the Microsoft researches have set up at Hugging Face, an online AI community which hosts thousands of open-source models. 

If you follow the steps below, you'll have a working chatbot you can show images or other media to and ask it to output images as well. I should note that, like other bots I've tried, the results were very mixed. 

How to Set Up and Try Microsoft Jarvis / HuggingGPT

1. Obtain an OpenAPI API Key if you don't already have one. You can get it at OpenAPI's website by signing in and clicking "Create new secret key." Signing up is free and you will get a free amount of credit, but you will have to pay for more if you use it up.  Store the key somewhere like in a text file, where you can easily get to it. Once you copy it, you can never get it again.

Create new secret key

(Image credit: Tom's Hardware)

2. Sign up for a free account at Hugging Face if you don't already have one and log in to the site. The site is located at huggingface.co not huggingface.com.

3. Navigate to Settings -> Access Tokens by clicking the links in the left rail.

navigate to hugging face settings

(Image credit: Tom's Hardware)

4. Click New Token.

click New Token

(Image credit: Tom's Hardware)

5. Name the token (anything at all), select "write" as role and click Generate.

Click generate

(Image credit: Tom's Hardware)

6. Copy the API key and keep it somewhere you can easily access it.

Copy API Key

(Image credit: Tom's Hardware)

7. Navigate to the HuggingGPT page

8. Paste your OpenAPI key and Hugging Face token into the appropriate fields. Then hit the submit button next to each.

Enter API Keys for HuggingGPT

(Image credit: Tom's Hardware)

9. Enter your prompt into the query box and click Send.

enter prompt and click send

(Image credit: Future)

How to Setup Jarvis / Hugging GPT in Linux

It's much easier to use HuggingGPT on the Hugging Face website. However, if you want to try installing it on your local Ubuntu PC, here's how. You may also be able to modify it to use more models.

1. Install git if you don't have it already.

sudo apt install git

2. Clone the Jarvis repository from your home directory.

git clone https://github.com/microsoft/JARVIS

3. Navigate to the Jarvis/server/configs folder.

cd JARVIS/server/configs

4. Edit the configuration files and enter your OpenAI API key and Hugging Face tokens where appropriate. They are config.azure.yaml, config.default.yaml, config.gradio.yaml and config.lite.yaml. In this how-to, we'll only be using the gradio file, you it makes sense to edit them all. You can edit them using nano (ex: nano config.gradio.yaml). If you don't have these API keys you can get them for free from OpenAI and Hugging Face.

(Image credit: Future)

5. Install Miniconda if you don't have it installed already. You'll need to download the latest version from the Miniconda site. After downloading the installer, you install it by going to the Downloads folder and entering bash followed by the install script name.

bash Miniconda3-latest-Linux-x86_64.sh

You'll be prompted to agree to a license agreement and confirm the install location. After you have installed Miniconda, close and reopen all terminal windows so that the command conda will now be in your file path. If it is not in your path, try rebooting.

6. Navigate back to the JARVIS/server directory.

7. Create and activate a jarvis conda environment.

conda create -n jarvis python=3.8
conda activate jarvis

8. Install some dependencies and models.

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
cd models
bash download.sh # required when `inference_mode` is `local` or `hybrid`. 

9. Return to the JARVIS/servers folder.

10. Run the command to launch the HuggingGPT local webserver using gradio.

python run_gradio_demo.py --config configs/config.gradio.yaml

You will then be given a local URL you can visit in your web browser. In my case, it was http://127.0.0.1:7860.

gradio link

(Image credit: Future)

11. Visit the URL (ex: http://127.0.0.1:7860) in your browser. If you are using Ubuntu in a VM, use the browser within the VM.

12. Enter your OpenAPI API key into the box at the top of the web page.

enter openapi api key

(Image credit: Future)

13. Enter your prompt(s) in the prompt box and hit Enter.

prompt box

(Image credit: Future)

Using the gradio server is just one possible way to interact with Jarvis under Linux. The Jarvis Github page has more choices. These include using the models server or starting a command-line based chat. 

I couldn't get most of these methods working (the command line chat worked ok but wasn't as nice an interface as the web one). Also, you may be able to install more models and get text-to-video generation going (which I could not).

What to Try with Jarvis / Hugging GPT

The bot can answer standard text questions, along  with queries asking about images, audio and video. It can also potentially generate images, sound or video for you. I say potentially because, if you use the web version, it's limited by whatever free models it can access from Hugging Face. On the Linux version, you may be able to add some additional models. 

There are some sample queries listed below the prompt box that you can click and try. These include feeding it three example images and having it count how many zebras are in them, asking it to tell a joke and show a cat picture or asking it to generate one image that looks like another one.

zebras in the picture

(Image credit: Future)

Since it's web-based, the way to feed it images is to send it the URLs of pictures that are online. However, if you are able to use the Linux version, you can store images locally in the JARVIS/server/public folder and refer to them by relative URLs (ex: /myimage.jpg would be in the public folder and /examples/myimage.jpg would be in the examples subfolder of public).

Most original queries I tried did not turn out particularly well. Image recognition was particularly poor. When I fed it images of M.2 SSDs and asked where I could buy one, it said that it had identified the SSDs as either a suitcase and then told me to find "a store." 

asking about an SSD

(Image credit: Future)

Similarly when I fed it a screenshot from Minecraft and asked it where I could buy it, it falsely claimed that it saw a kite flying through the air. It thought an RTX 4070 was a black and white photo of a computer. And when I asked where I could buy one, it said "you can purchase one of these items from our online store or from a variety of retailers near you." but there was no actual link to any real online store.

RTX 4070

(Image credit: Future)

It wasn't very good at generating images on demand. For example, I asked it draw Abraham Lincoln driving a convertible and it just gave me a straightforward bust of the former president. 

drawing lincoln

(Image credit: Future)

In short, apart from the specific examples Microsoft suggests, most queries did not turn out particularly well. But as with other AI frameworks such as Auto-GPT and BabyAGI, the problem is in the models you use and, as the models improve, so will your output. If you want to try autonomous agents, check out our tutorials on how to use Auto-GPT and how to use BabyAGI.

Avram Piltch
Avram Piltch is Tom's Hardware's editor-in-chief. When he's not playing with the latest gadgets at work or putting on VR helmets at trade shows, you'll find him rooting his phone, taking apart his PC or coding plugins. With his technical knowledge and passion for testing, Avram developed many real-world benchmarks, including our laptop battery test.
  • bit_user
    I wonder if they paid any license fee to Marvel/Disney, because using the name Jarvis for an AI assistant is so clearly a reference to Iron Man's AI assistant.

    I think there was a lawsuit against someone using the name Jeeves, a few years ago, and they actually won.

    The company recommends you use Ubuntu (the outdated version 16 LTS specifically)
    That's pretty nuts, because it stopped getting security updates like 2 years ago! The LTS versions are released in April of every even year, and get support for only 5 years. So, 18.04 LTS is just now reaching the end of its support window.

    Hugging Face, an online AI community which hosts thousands of open-source models.
    That name always conjures in my mind the scene from Alien, where it latches onto a guy's face and deposits an egg down his throat.
    Reply
  • punkncat
    One of my first thoughts was to the rights for the Jarvis moniker.
    Reply