Stockfish-AI Onboarding Walkthrough

This page serves as a technical onboarding for members of the HUMIC community. Beyond some familiarity with Python, no prior knowledge of machine learning or AI is expected. The following learning objectives will be covered:

CI/CD technologies like Github
Engineering workflows with Hugging Face models
Storing and retrieving data from Vector databases
Quick iteration of the frontend with Gradio

Although most of the code can be easily completed with coding assistants, we encourage self-exploration to get the most out of this walkthrough.

🚨 Disclaimer

The tutorial is intended to be run on a Unix-based machine such as a Mac or Linux machine. Students on other machines can still follow along with some adjustments.

Setup

Getting started with Github

Github is a popular version control system that allows you to track changes to your code. Being familiar with Github commands will become increasingly important as you work on projects with larger teams. Conceptually every repository will have a remote version (which you see on the website) and a local version (which you see in your IDE). Every developer will then push their changes from local -> remote and pull a collaborator's changes from remote -> local.

After signing up, you can fork the starter template available at https://github.com/HUMIC-CLUB/Stockfish-template .

Github Fork — Fork the repository to your own Github account

If you kept the default name it should be at https://github.com/<YourUsername>/Stockfish-template . To create a local version from the remote version, you can clone the repository to your local machine using the following command:

shell

$ git clone https://github.com/<YourUsername>/Stockfish-template.git

Python virtual environments

When installing Python libraries, it is recommended to use a virtual environment to avoid conflicts with other projects. Pip is the most popular package manager for Python, and it is the one we will be using. To create a virtual environment, in the folder of the project (remember to use cd to navigate to the folder), run the following command:

shell

$ python -m venv .myenv

This will create a virtual environment in the folder .myenv . Feel free to use a different name. To activate the virtual environment, run the following command:

shell

$ source .myenv/bin/activate

You can deactivate the virtual environment by running deactivate .

There will be quite a few dependencies to install. We have added all the necessary dependencies in the requirements.txt file. To install them, run the following command while the virtual environment is activated:

shell

$ pip install -r requirements.txt

💡 Digging through pip dependencies

You can see which packages have been installed by running the following command:

shell

$ pip list

Sometimes this list gets pretty long so to check if a specific package has been installed like "transformers", you can run the following command:

shell

$ pip list | grep transformers

The packages will be in the .myenv folder. We will see in the next part that this is only for your copy of the repo so it should not be pushed to the remote repository.

Image Embedding Model

What are vector embeddings?

Vector embeddings are a way to represent rich data like images, text, audio, etc. in a vector space. Every possible object is represented as an array of numbers with fixed length, known as the embedding dimension . For further reading, you can check out this article or look at our fellowship resources here .

We want a model to convert an image of a fish into a vector. Objects that are close to each other in the vector space are semantically similar (e.g. species of the same fish should be closer to each other geometrically).

Vector Embedding of fishes — Vector embedding visualization generated by ChatGPT

Hugging Face Models

Hugging Face is a popular platform for downloading open-weight machine learning models. They also provide Python libraries to download and use the models such as Sentence Transformers which we will use for our Image Embedding Model.

python

class ImageEmbedder:
    """Handles image embedding generation using CLIP model."""

    def __init__(self, model_name: str, dimension: int):
        """
        Initialize the CLIP model for image embeddings.

        Args:
            model_name: Name of the sentence-transformers CLIP model
            dimension: Dimension of the embedding
        """
        self.model = SentenceTransformer(model_name)
        self.dimension = dimension

In src/embeddings.py , we will implement the ImageEmbedder class. Note the use of Python typings in the class constructor which helps with type checking and documentation.

CLIP Model

There are many pre-trained models available on Hugging Face. We will be using the sentence-transformers/clip-ViT-B-32 model which is a Vision Transformer that was trained jointly on text and images, with a paired dataset of captioned images (read more here ). This model gives embeddings of dimension 512 .

This Python file will most likely be imported to another file, but we can still test it in isolation. A quick way to add a small unit test is to add a main block that is run when the file is executed directly.

python

if __name__ == "__main__":
    # Initialize embedder
    embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)

If we run the file directly, we should see the model weights being downloaded locally with the safetensor format.

shell

$ python src/embeddings.py

ℹ️ Caching weights

You may ask if the model will be downloaded every time the file is executed. The answer is only the first time for that model. Subsequent runs will use the cached weights. Check out "~/.cache/huggingface/hub" for all your models and their weights.

Generating Embeddings

We can now implement the get_image_embedding function. For convenience when processing images we will use the Pillow library

python

from PIL import Image
...
def get_image_embedding(image_path: str) -> List[float]:
    # Load image
    image = Image.open(image_path)
    
    # Generate embedding (returns numpy array)
    embedding = self.model.encode(image, convert_to_numpy=True)
    
    # Convert to list of floats
    return embedding.tolist()

Notice the return type is a list of floats given by List[float] . It requires a local image path as input. We already have fishes stored in the static directory such as static/clownfish.jpeg . We can now test it out at the bottom of our python file and we should see an array with 512 floats:

python

...
if __name__ == "__main__":
    # Initialize embedder
    embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
    
    image_path = "static/clownfish.jpeg"

    embedding = embedder.get_image_embedding(image_path)
    print(f"Embedding vector (dimension: {len(embedding)}):")
    print(embedding)

Pushing to Github

Before pushing it is useful to check the status of your local version such as which files can be committed with git status . Note that our Python virtual environment would be committed which is not ideal.

To exclude the Python virtual environment from being committed, we can add it to the .gitignore file. This file will be used again when we want to prevent pushing secrets such as api keys.

Now our code is ready to be pushed. First, we need to add the changes to the staging area using the following command:

shell

$ git add .

Then, we can commit the changes. The -m flag allows you to tag a descriptive message to the commit.

shell

$ git commit -m "Feat: implemented core image embedder functionality"

Finally, we can push the changes to the remote repository, which should show up on the Github website. If you get an error, make sure you are on your own forked repository and not the original template repository.

shell

$ git push

ℹ️ Git Branches

We will be mostly using the main branch for our project. When many people are touching the same code, you will often push to a separate branch instead of the main branch so that merge conflicts are less disruptive. Read here for more information on Git Branches .

Part B Exercises

Batch Embeddings

Implement a new class method get_batch_embeddings which takes in a list of image paths and encodes all of them at once. Return the list of embedding vectors. Try to only call the "encode" function once.

Scrappy Testing

Test the method(s) from the previous exercise by adding two more images of fish from the internet to the static folder and passing in a list of image file paths to the method at the bottom of the Python file.

Push your changes to the Github remote repository.

Vector Databases with Pinecone

Pinecone setup

A vector embedding is not that interesting by itself. But once combined with many data points, the distance between every two entries in the dataset becomes meaningful. A vector database is a way to store and query these embeddings. Many general database services support vector search such as Supabase and MongoDB . We will be taking a look at Pinecone which is more specialized in this domain. They have a generous starter tier just by signing up.

After signing up, we can create a new index, called fish-index . An index is a collection of vectors that can be queried. You can use any configuration, so long as it allows for the same dimension as our CLIP model 512 .

Create Index — Creating a new index in Pinecone

To ensure strict access to the database, the secret password is known as the API Key . For projects where the source code will be published such as on Github, it is important to make sure the API key is somehow ignored by the version control system. The .env paradigm is almost universal now for storing environment variables. Make sure to add the .env file to the .gitignore file before pushing to Github. After copying an api key from Pinecone, we can create a new .env file in the root directory and add the following line:

.env

PINECONE_API_KEY=your-api-key

API Key in Pinecone — Left panel of Pinecone website

Connecting to the database

We can see from the template code in src/database.py that we will be using the Pinecone library to interact with the database.

python

from pinecone import Pinecone
...
class FishVectorDB:
    """Handles Pinecone vector database operations for fish embeddings."""
    
    def __init__(
        self,
        index_name: str,
        namespace: str
    ):
        """
        Initialize Pinecone client and connect to index.
        
        Args:
            index_name: Name of the Pinecone index
            namespace: Namespace for data isolation (mandatory)
        """

        # Initialize Pinecone client
        pc = Pinecone(api_key="FILL IN YOUR API KEY HERE")
        self.index_name = index_name
        self.namespace = namespace

To gain access to the specific slice of data we need, we have to specify the index name fish-index and then a namespace . To avoid hardcoding the API key, we load the env file with the python-dotenv library.

python

import os
from dotenv import load_dotenv
...
class FishVectorDB:
    """Handles Pinecone vector database operations for fish embeddings."""
    
    def __init__(
        self,
        index_name: str,
        namespace: str
    ):
        # Load environment variables from .env file
        load_dotenv()
        # Initialize Pinecone client
        pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
        ...

The rest of the constructor is just extracting the reference to the index from pc.Index(index_name)

Seeding the database

The upsert operation allows one to either insert or update a vector to the database depending on if the id already exists. Every vector that represents our image will have accompanying metadata as seen in static/fish_species.csv .

csv

id,species,description,region,conservation_status
clownfish,Clownfish,"Bright orange fish with white stripes, commonly found in coral reefs.",Indo-Pacific,LC
goldfish,Goldfish,"Small orange or gold-colored freshwater fish, popular in aquariums.",Worldwide (domestic),LC        
...

🚨 Dataset disclaimer

The fish dataset is not an official one and was scraped from the internet from various sources. Apologies for any inaccuracies.

The csv parsing logic is already implemented in src/seed.py .

python

from embeddings import ImageEmbedder
from database import FishVectorDB
...
def seed_from_csv(
    csv_path: str,
    embedder: ImageEmbedder,
    db: FishVectorDB
) -> None:
    """
    Seed Pinecone database from CSV file.
    
    Args:
        csv_path: Path to CSV file with columns: id, species, description, region, conservation_status, image_path
        embedder: ImageEmbedder instance for generating embeddings
        db: FishVectorDB instance
    """
      ...

We can then use the upsert_batch method from our FishVectorDB class to upsert the vectors to the database after the csv file has been parsed.

python

...
    # Upsert in batch
    if fish_records:
        print(f"Upserting {len(fish_records)} fish records...")
        db.upsert_batch(fish_records)
        print("Seeding completed!")
    else:
        print("No records to upsert.")
      ...

Now all thats left is to run the seeding function with python src/seed.py . Calling the function in the main block should be enough.

python

...
if __name__ == "__main__":
    embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)

    db = FishVectorDB(index_name="fish-index", namespace="fish_species")

    print("=== Seeding database from CSV ===")
    seed_from_csv("static/fish_species.csv", embedder, db)

By going to the Pinecone console in the browser, we can see the metadata has been upserted including the vector representation of each image in the static directory.

💡 Resetting the index

If the seed function only upserted partially, or it was accidentally executed twice, you can reset the database with:

python

db.clear_index()

Part C Exercises

Batched upserts

For larger datasets, we may quickly hit rate limits from Pinecone in our upsert_batch method. Run a loop to upsert the vectors in batches of maximum size batch_size .

Database stats

Implement the get_stats method in the FishVectorDB class which returns useful information about the database such as the number of vectors and the dimension of each one. The describe_index_stats method may be useful here.

Querying a Vector DB

Cosine similarity search

This section will be a bit shorter but more technical. We need a way to find the closest fish species vectors to an embedding vector of an unseen image. Cosine similarity is a popular way of measuring this which gives a score for each vector in the database between -1 and 1 with a higher score indicating more similarity. We will add a new function search_similar back in our FishVectorDB class to return the top k results.

python

...
    def search_similar(
        self,
        query_embedding: List[float],
        top_k: int
    ) -> List[Dict[str, Any]]:
        """
        Search for similar fish based on embedding similarity.
        
        Args:
            query_embedding: 512-d embedding vector to search for
            top_k: Number of results to return
        Returns:
            List of search results with id, score, and metadata
        """

        # Query using vector
        results = self.index.query(
            namespace=self.namespace,
            vector=query_embedding,
            top_k=top_k,
            metric="cosine"
        )


        # Format results
        formatted_results = []
        for match in results.matches:
            formatted_results.append({
                "id": match.id,
                "score": match.score,
            })
        return formatted_results
        ...

This is essentially a wrapper around the index.query method from Pinecone. We don't have to return all the results, so the algorithm is optimized for a small value of top_k results. There are a lot of cool optimizations that are done under the hood, stemming from a research area known as Hierarchical Navigable Small World (HNSW) graphs.

We can now test it out at the bottom of our python file. We have to import from our embeddings.py file to get the embedding of a new image, which is not the best structure, but will do for now (rather pre-load the test embeddings). There is already a new image in the static directory called static/nemo.jpeg , but feel free to get an image of your own.

python

...
if __name__ == "__main__":
    # Set up embedder and database
    from embeddings import ImageEmbedder
    embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
    db = FishVectorDB(index_name="fish-index", namespace="fish_species")
    
    # Example: search for nemo
    image_path = "static/nemo.jpeg"
    embedding = embedder.get_image_embedding(image_path)
    
    print("=== Search for similar fish ===")
    results = db.search_similar(embedding, top_k=5)
    print(f"Found {len(results)} similar fish:")
    for result in results:
        print(f"  - {result['id']}: {result['score']:.4f}")

The CLIP model should pick up on the general concept of a fish in water, but should also match with more specific characteristics such as the color of the fish. Running the file should give something similar to the following output:

Nemo Search — Vector DB query to find Nemo

💡 Exact matches

If you try querying with an image already in the database, you may get an exact match with a similarity score > 1.0 ?! This may be due to some numerical instability in the cosine similarity calculation.

Adding metadata to searches

The core functionality is now implemented for vector search, but we may also want to retrieve the other fields associated with the fish species such as the description, and region. The metadata is already stored in the database, so we can just add the include_metadata parameter to the search_similar method.

python

...
    # Query using vector with metadata
        results = self.index.query(
            namespace=self.namespace,
            vector=query_embedding,
            top_k=top_k,
            metric="cosine",
            include_metadata=include_metadata
        )


        # Format results
        formatted_results = []
        for match in results.matches:
            formatted_results.append({
                "id": match.id,
                "score": match.score,
                "metadata": match.metadata if include_metadata else {}
            })
        return formatted_results

Try test it in the main block of the file by calling db.search_similar(embedding, top_k=5, include_metadata=True)

Part D Exercises

Score metric

Expand the search_similar method to allow for a different metric such as euclidean or dot_product . You can read more about how each one is implemented here . Which one performs better than the others?

Bad bird🐦‍⬛

If we try to search for a bird image such as static/pelican.jpeg , our database seems to hallucinate that it is a fish with confidence. Add a heuristic in search_similar to notify if the uploaded image is not a fish.

Pelican Search — Vector DB query for Pelican image

Admittedly, this is task is annoying in the current setup, because the dataset is so small. Try to consider relative measures so if the top similarity score is not much different to the second best, this can be a proxy for uncertainty.

Frontend with Gradio

Why Gradio?

Gradio is a library for quickly testing backend functions with intuitive frontend interfaces. Although it's toolkit is limited from a UI perspective, it is a powerful tool for quickly iterating on apps that depend on machine learning models. I.e. its main appeal is rather the quick learning curve and ease of implementation so that developers and researchers can spend more time on the backend and less time searching how to center a div. You can read about more specific reasons why Gradio is worth learning for ML-heavy apps here .

Launching a Gradio app

Some template code has already been provided in src/app.py . There are two main functions therein: (i) create_ui which returns a demo object that is used to launch the app,and (ii) recognize_fish which is the backend function that will be called when the user uploads an image of a fish. To launch the app locally, we can run the following command:

shell

$ python src/app.py

There will be a local URL that you can open in your browser to test the app (E.g. http://127.0.0.1:7860 or http://localhost:7860). You should be able to upload any image, but the returned text will not be implemented yet.

Gradio Example — Gradio template app launched locally

But what if you are working on a team and want to share your app with others? Gradio provides functionality to share locally hosted apps with the share=True parameter. You should receive a local and public URL (expires in 1 week) that others can access from a different device.

python


...
if __name__ == "__main__":
    demo = create_ui()
    demo.launch(share=True)

🚨 Shared Gradio apps

When navigating to the shared URL, you will notice that the startup time and general latency is much higher than the local version.

Calling the functions from the frontend

The Gradio components in the context of gr.Blocks just defines the layout of all the different UI interfaces, but is rather static and disconnected. To add dynamic functionality, we add inputs and outputs fields to an action, such as whenever the image component changes:

python

...
  image_input.change(
    fn=recognize_fish,
    inputs=image_input,
    outputs=[output]
  )

This acts as a pipeline: whenever the image_input component changes, the recognize_fish function is called with the new image as input and the output is displayed in the output component which is just basic markdown in this case.

Next, we initialize our other classes, ImageEmbedder and FishVectorDB to be used in the recognize_fish function. These can be done in the global scope to avoid re-initializing them for every call.

python

...
from embeddings import ImageEmbedder
from database import FishVectorDB

# Initialize components
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
db = FishVectorDB(index_name="fish-index", namespace="fish_species")
...

For any uploaded image, we store it temporarily locally so that we can use the image path that is compatible with our get_image_embedding function. This will allow even copy-pasting images from the internet to work.

python

...
    # Generate embedding
    embedding = embedder.get_image_embedding(tmp_path)
    
    # Search for similar fish
    results = db.search_similar(embedding, top_k=1, include_metadata=True)

    # Clean up temporary file
    os.unlink(tmp_path)
...

Since we are only interested in the top result, we can just access the first element of the results list and then also print out a formatted markdown output. Python's f-string formatting is convenient here.

python

...# Get top result
    top_result = results[0]
    metadata = top_result.get('metadata', {})

    output = f"""# Top Match in FishVectorDB:

**ID:** {top_result.get('id', 'N/A')}

**Species:** {metadata.get('species', 'Unknown')}

**Similarity Score:** {top_result.get('score', 0):.4f}

**Description:** {metadata.get('description', 'No description available')}

**Region:** {metadata.get('region', 'Unknown')}

**Conservation Status:** {metadata.get('conservation_status', 'Unknown')}
"""
    return output
...

Try some of the example images at the bottom or upload your own to see if it works.

Beautifying a metadata field

We will format the conservation status in a more readable way by adding a color to the text. Gradio has a HighlightedText component that can be used for this purpose. There is a get_conservation_status function in src/utils.py which will be used to do the color and text mapping for each corresponding status code. No need to look too deeply into this file.

We will have to separate the conservation status from the rest of the metadata to be able to use the HighlightedText component.

python

from utils import get_conservation_status
  ...
   with gr.Column():
      output = gr.Markdown(label="Search Results")
      conservation_status = gr.HighlightedText(
          label="Conservation Status",
          color_map=get_conservation_status(),
          show_legend=True,
          show_inline_category=True
      )
    ...

Then our recognize_fish function will return two outputs: the markdown output and the highlighted conservation status.

python

...
       status_code = metadata.get('conservation_status', 'N/A')
       highlighted_status = get_conservation_status(status_code)
       output = ...
       return output, highlighted_status

Remember to also change the outputs field of the image_input.change action to now have [output, conservation_status] instead of just [output] . You should see an output like the following:

Gradio Final App — Gradio UI with highlighted conservation status

Part E Exercises

Theming your Gradio app

Change the theme of the Gradio app to a different color scheme. You can read more about the different themes here .

Formatting other metadata fields

Add at least one other formatted component for the other metadata fields. You can be as creative as you want. Some possible ideas are:

Highlighted Text component for the similarity score
Metadata in a table format
etc.

Displaying k results

Instead of just displaying the top result, display the top k results in a list format. Add a slider component to the Gradio app to allow the user to select the number of results to display. How the list of results is displayed is up to your taste.

ℹ️ Conclusion

This reaches the end of the formal tutorial. If you have managed to complete all the exercises, you should have picked up on a lot of best practices for modern development whether you plan to do research engineering, startup building or even general software development. If you have any feedback or suggestions, please reach out to the contact information at the bottom of the page.

Bonus: Context-aware chatbot

Add a chatbot interface to Gradio that allows the user to ask follow-up questions about the fish species detected.