The tutorial is intended to be run on a Unix-based machine such as a Mac or Linux machine. Students on other machines can still follow along with some adjustments.
Stockfish-AI Onboarding Walkthrough
This page serves as a technical onboarding for members of the HUMIC community. Beyond some familiarity with Python, no prior knowledge of machine learning or AI
is expected.
The following learning objectives will be covered:
- CI/CD technologies like Github
- Engineering workflows with Hugging Face models
- Storing and retrieving data from Vector databases
- Quick iteration of the frontend with Gradio
🚨 Disclaimer
Setup
Getting started with Github
Github is a popular version control system that allows you to track changes to your code.
Being familiar with Github commands will become increasingly important as you work on projects with larger teams.
Conceptually every repository will have a remote version (which you see on the website) and a local version (which you see in your IDE).
Every developer will then push their changes from local -> remote and pull a collaborator's changes from remote -> local.
After signing up, you can fork the starter template available at https://github.com/HUMIC-CLUB/Stockfish-template .
If you kept the default name it should be at https://github.com/<YourUsername>/Stockfish-template .
To create a local version from the remote version, you can clone the repository to your local machine using the following command:
shell
$ git clone https://github.com/<YourUsername>/Stockfish-template.gitPython virtual environments
When installing Python libraries, it is recommended to use a virtual environment to avoid conflicts with other projects.
Pip is the most popular package manager for Python, and it is the one we will be using.
To create a virtual environment, in the folder of the project (remember to use cd to navigate to the folder), run the following command:
shell
$ python -m venv .myenv
This will create a virtual environment in the folder .myenv . Feel free to use a different name.
To activate the virtual environment, run the following command:
shell
$ source .myenv/bin/activate
You can deactivate the virtual environment by running deactivate .
There will be quite a few dependencies to install. We have added all the necessary dependencies in the requirements.txt file.
To install them, run the following command while the virtual environment is activated:
shell
$ pip install -r requirements.txt 💡 Digging through pip dependencies
You can see which packages have been installed by running the following command:
Sometimes this list gets pretty long so to check if a specific package has been installed like "transformers", you can run the following command:
shell
$ pip listshell
$ pip list | grep transformers
The packages will be in the .myenv folder. We will see in the next part that this is only for your copy of the repo so it should not be pushed to the remote repository.
Image Embedding Model
What are vector embeddings?
Vector embeddings are a way to represent rich data like images, text, audio, etc. in a vector space.
Every possible object is represented as an array of numbers with fixed length, known as the embedding dimension .
For further reading, you can check out this article or look at our fellowship resources here .
We want a model to convert an image of a fish into a vector. Objects that are close to each other in the vector space are semantically similar (e.g. species of the same fish should be closer to each other geometrically).
Hugging Face Models
Hugging Face is a popular platform for downloading open-weight machine learning models.
They also provide Python libraries to download and use the models such as Sentence Transformers
which we will use for our Image Embedding Model.
python
class ImageEmbedder:
"""Handles image embedding generation using CLIP model."""
def __init__(self, model_name: str, dimension: int):
"""
Initialize the CLIP model for image embeddings.
Args:
model_name: Name of the sentence-transformers CLIP model
dimension: Dimension of the embedding
"""
self.model = SentenceTransformer(model_name)
self.dimension = dimension
In src/embeddings.py , we will implement the ImageEmbedder class.
Note the use of Python typings in the class constructor which helps with type checking and documentation.
CLIP Model
There are many pre-trained models available on Hugging Face. We will be using the sentence-transformers/clip-ViT-B-32 model which is a Vision Transformer
that was trained jointly on text and images, with a paired dataset of captioned images (read more here ).
This model gives embeddings of dimension 512 .
This Python file will most likely be imported to another file, but we can still test it in isolation. A quick way to add a small unit test is to add a main block that is run when the file is executed directly.
If we run the file directly, we should see the model weights being downloaded locally with the safetensor format.
python
if __name__ == "__main__":
# Initialize embedder
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)shell
$ python src/embeddings.py ℹ️ Caching weights
You may ask if the model will be downloaded every time the file is executed.
The answer is only the first time for that model. Subsequent runs will use the cached weights.
Check out "~/.cache/huggingface/hub" for all your models and their weights.
Generating Embeddings
We can now implement the get_image_embedding function.
For convenience when processing images we will use the Pillow library
Notice the return type is a list of floats given by List[float] .
It requires a local image path as input. We already have fishes stored in the static directory such as static/clownfish.jpeg . We can now test it out at the bottom of our python file and we should see an array with 512 floats:
python
from PIL import Image
...
def get_image_embedding(image_path: str) -> List[float]:
# Load image
image = Image.open(image_path)
# Generate embedding (returns numpy array)
embedding = self.model.encode(image, convert_to_numpy=True)
# Convert to list of floats
return embedding.tolist()python
...
if __name__ == "__main__":
# Initialize embedder
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
image_path = "static/clownfish.jpeg"
embedding = embedder.get_image_embedding(image_path)
print(f"Embedding vector (dimension: {len(embedding)}):")
print(embedding)Pushing to Github
Before pushing it is useful to check the status of your local version such as which files can be committed with git status .
Note that our Python virtual environment would be committed which is not ideal.
To exclude the Python virtual environment from being committed, we can add it to the .gitignore file.
This file will be used again when we want to prevent pushing secrets such as api keys.
Now our code is ready to be pushed. First, we need to add the changes to the staging area using the following command:
Then, we can commit the changes. The -m flag allows you to tag a descriptive message to the commit.
Finally, we can push the changes to the remote repository, which should show up on the Github website.
If you get an error, make sure you are on your own forked repository and not the original template repository.
shell
$ git add .shell
$ git commit -m "Feat: implemented core image embedder functionality"shell
$ git push ℹ️ Git Branches
We will be mostly using the main branch for our project.
When many people are touching the same code, you will often push to a separate branch instead of the main branch so that merge conflicts are less disruptive.
Read here for more information on Git Branches .
Part B Exercises
Batch Embeddings
Implement a new class method get_batch_embeddings which takes in a list of image paths and encodes all of them at once.
Return the list of embedding vectors. Try to only call the "encode" function once.
Scrappy Testing
Test the method(s) from the previous exercise by adding two more images of fish from the internet to the static folder
and passing in a list of image file paths to the method at the bottom of the Python file.
Push your changes to the Github remote repository.
Vector Databases with Pinecone
Pinecone setup
A vector embedding is not that interesting by itself. But once combined with many data points, the distance between every two entries in the dataset becomes meaningful.
A vector database is a way to store and query these embeddings. Many general database services support vector search such as Supabase and MongoDB .
We will be taking a look at Pinecone which is more specialized in this domain. They have a generous starter tier just by signing up.
After signing up, we can create a new index, called fish-index . An index is a collection of vectors that can be queried. You can use any configuration, so long as it allows for the same dimension as our CLIP model 512 .
To ensure strict access to the database, the secret password is known as the API Key . For projects where the source code will be published such as on Github,
it is important to make sure the API key is somehow ignored by the version control system. The .env paradigm is almost universal now for storing environment variables.
Make sure to add the .env file to the .gitignore file before pushing to Github. After copying an api key from Pinecone, we can create a new .env file in the root directory and add the following line:
.env
PINECONE_API_KEY=your-api-key
Connecting to the database
We can see from the template code in src/database.py that we will be using the Pinecone library to interact with the database.
To gain access to the specific slice of data we need, we have to specify the index name fish-index and then a namespace .
To avoid hardcoding the API key, we load the env file with the python-dotenv library.
The rest of the constructor is just extracting the reference to the index from pc.Index(index_name)
python
from pinecone import Pinecone
...
class FishVectorDB:
"""Handles Pinecone vector database operations for fish embeddings."""
def __init__(
self,
index_name: str,
namespace: str
):
"""
Initialize Pinecone client and connect to index.
Args:
index_name: Name of the Pinecone index
namespace: Namespace for data isolation (mandatory)
"""
# Initialize Pinecone client
pc = Pinecone(api_key="FILL IN YOUR API KEY HERE")
self.index_name = index_name
self.namespace = namespacepython
import os
from dotenv import load_dotenv
...
class FishVectorDB:
"""Handles Pinecone vector database operations for fish embeddings."""
def __init__(
self,
index_name: str,
namespace: str
):
# Load environment variables from .env file
load_dotenv()
# Initialize Pinecone client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
...
Seeding the database
The upsert operation allows one to either insert or update a vector to the database depending on if the id already exists. Every vector that represents our image will have accompanying metadata as seen in static/fish_species.csv .
csv
id,species,description,region,conservation_status
clownfish,Clownfish,"Bright orange fish with white stripes, commonly found in coral reefs.",Indo-Pacific,LC
goldfish,Goldfish,"Small orange or gold-colored freshwater fish, popular in aquariums.",Worldwide (domestic),LC
...
🚨 Dataset disclaimer
The fish dataset is not an official one and was scraped from the internet from various sources. Apologies for any inaccuracies.
The csv parsing logic is already implemented in src/seed.py .
We can then use the upsert_batch method from our FishVectorDB class to upsert the vectors to the database after the csv file has been parsed.
Now all thats left is to run the seeding function with python src/seed.py . Calling the function in the main block should be enough.
By going to the Pinecone console in the browser, we can see the metadata has been upserted including the vector representation of each image in the static directory.
python
from embeddings import ImageEmbedder
from database import FishVectorDB
...
def seed_from_csv(
csv_path: str,
embedder: ImageEmbedder,
db: FishVectorDB
) -> None:
"""
Seed Pinecone database from CSV file.
Args:
csv_path: Path to CSV file with columns: id, species, description, region, conservation_status, image_path
embedder: ImageEmbedder instance for generating embeddings
db: FishVectorDB instance
"""
...python
...
# Upsert in batch
if fish_records:
print(f"Upserting {len(fish_records)} fish records...")
db.upsert_batch(fish_records)
print("Seeding completed!")
else:
print("No records to upsert.")
...python
...
if __name__ == "__main__":
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
db = FishVectorDB(index_name="fish-index", namespace="fish_species")
print("=== Seeding database from CSV ===")
seed_from_csv("static/fish_species.csv", embedder, db) 💡 Resetting the index
If the seed function only upserted partially, or it was accidentally executed twice, you can reset the database with:
python
db.clear_index()Part C Exercises
Batched upserts
For larger datasets, we may quickly hit rate limits from Pinecone in our upsert_batch method.
Run a loop to upsert the vectors in batches of maximum size batch_size .
Database stats
Implement the get_stats method in the FishVectorDB class which returns useful information about the database such as the number of vectors and the dimension of each one.
The describe_index_stats method may be useful here.
Querying a Vector DB
Cosine similarity search
This section will be a bit shorter but more technical.
We need a way to find the closest fish species vectors to an embedding vector of an unseen image. Cosine similarity is a popular
way of measuring this which gives a score for each vector in the database between -1 and 1 with a higher score indicating more similarity.
We will add a new function search_similar back in our FishVectorDB class to return the top k results.
python
...
def search_similar(
self,
query_embedding: List[float],
top_k: int
) -> List[Dict[str, Any]]:
"""
Search for similar fish based on embedding similarity.
Args:
query_embedding: 512-d embedding vector to search for
top_k: Number of results to return
Returns:
List of search results with id, score, and metadata
"""
# Query using vector
results = self.index.query(
namespace=self.namespace,
vector=query_embedding,
top_k=top_k,
metric="cosine"
)
# Format results
formatted_results = []
for match in results.matches:
formatted_results.append({
"id": match.id,
"score": match.score,
})
return formatted_results
...
This is essentially a wrapper around the index.query method from Pinecone. We don't have to return all the results,
so the algorithm is optimized for a small value of top_k results.
There are a lot of cool optimizations that are done under the hood, stemming from a research area known as
Hierarchical Navigable Small World (HNSW) graphs.
We can now test it out at the bottom of our python file. We have to import from our embeddings.py file to get the embedding of a new image, which is not the
best structure, but will do for now (rather pre-load the test embeddings). There is already a new image in the static directory called static/nemo.jpeg , but feel free to get an image of your own.
The CLIP model should pick up on the general concept of a fish in water, but should also match with more specific characteristics such as the color of the fish.
Running the file should give something similar to the following output:
python
...
if __name__ == "__main__":
# Set up embedder and database
from embeddings import ImageEmbedder
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
db = FishVectorDB(index_name="fish-index", namespace="fish_species")
# Example: search for nemo
image_path = "static/nemo.jpeg"
embedding = embedder.get_image_embedding(image_path)
print("=== Search for similar fish ===")
results = db.search_similar(embedding, top_k=5)
print(f"Found {len(results)} similar fish:")
for result in results:
print(f" - {result['id']}: {result['score']:.4f}")
💡 Exact matches
If you try querying with an image already in the database, you may get an exact match with
a similarity score > 1.0 ?! This may be due to some numerical instability in the cosine similarity calculation.
Adding metadata to searches
The core functionality is now implemented for vector search, but we may also want to retrieve the other fields associated with the fish species such as the description, and region.
The metadata is already stored in the database, so we can just add the include_metadata parameter to the search_similar method.
Try test it in the main block of the file by calling db.search_similar(embedding, top_k=5, include_metadata=True)
python
...
# Query using vector with metadata
results = self.index.query(
namespace=self.namespace,
vector=query_embedding,
top_k=top_k,
metric="cosine",
include_metadata=include_metadata
)
# Format results
formatted_results = []
for match in results.matches:
formatted_results.append({
"id": match.id,
"score": match.score,
"metadata": match.metadata if include_metadata else {}
})
return formatted_resultsPart D Exercises
Score metric
Expand the search_similar method to allow for a different metric such as euclidean or dot_product .
You can read more about how each one is implemented here .
Which one performs better than the others?
Bad bird🐦⬛
If we try to search for a bird image such as static/pelican.jpeg , our database seems to hallucinate that it is a fish with confidence.
Add a heuristic in search_similar to notify if the uploaded image is not a fish.
Admittedly, this is task is annoying in the current setup, because the dataset is so small.
Try to consider relative measures so if the top similarity score is not much different to the second best, this can be a proxy for uncertainty.
Frontend with Gradio
Why Gradio?
Gradio is a library for quickly testing backend functions with intuitive frontend interfaces.
Although it's toolkit is limited from a UI perspective, it is a powerful tool for quickly iterating on apps that depend on machine learning models.
I.e. its main appeal is rather the quick learning curve and ease of implementation so that developers and researchers can spend more time on the backend and less
time searching how to center a div. You can read about more specific reasons why Gradio is worth learning for ML-heavy apps here .
Launching a Gradio app
Some template code has already been provided in src/app.py . There are two main functions therein:
(i) create_ui which returns a demo object that is used to launch the app,and
(ii) recognize_fish which is the backend function that will be called when the user uploads an image of a fish.
To launch the app locally, we can run the following command:
There will be a local URL that you can open in your browser to test the app (E.g. http://127.0.0.1:7860 or http://localhost:7860).
You should be able to upload any image, but the returned text will not be implemented yet.
shell
$ python src/app.py
But what if you are working on a team and want to share your app with others?
Gradio provides functionality to share locally hosted apps with the share=True parameter.
You should receive a local and public URL (expires in 1 week) that others can access from a different device.
python
...
if __name__ == "__main__":
demo = create_ui()
demo.launch(share=True) 🚨 Shared Gradio apps
When navigating to the shared URL, you will notice that the startup time and general latency is much higher than the local version.
Calling the functions from the frontend
The Gradio components in the context of gr.Blocks just defines the layout of all the
different UI interfaces, but is rather static and disconnected. To add dynamic functionality,
we add inputs and outputs fields to an action, such as whenever the image component changes:
This acts as a pipeline: whenever the image_input component changes, the recognize_fish function is called
with the new image as input and the output is displayed in the output component which is just basic markdown in this case.
python
...
image_input.change(
fn=recognize_fish,
inputs=image_input,
outputs=[output]
) Next, we initialize our other classes, ImageEmbedder and FishVectorDB to be used in the recognize_fish function.
These can be done in the global scope to avoid re-initializing them for every call.
For any uploaded image, we store it temporarily locally so that we can use the image path that is compatible with our get_image_embedding function.
This will allow even copy-pasting images from the internet to work.
Since we are only interested in the top result, we can just access the first element of the results list and then also print out a formatted
markdown output. Python's f-string formatting is convenient here.
Try some of the example images at the bottom or upload your own to see if it works.
python
...
from embeddings import ImageEmbedder
from database import FishVectorDB
# Initialize components
embedder = ImageEmbedder(model_name='clip-ViT-B-32', dimension=512)
db = FishVectorDB(index_name="fish-index", namespace="fish_species")
...python
...
# Generate embedding
embedding = embedder.get_image_embedding(tmp_path)
# Search for similar fish
results = db.search_similar(embedding, top_k=1, include_metadata=True)
# Clean up temporary file
os.unlink(tmp_path)
...python
...# Get top result
top_result = results[0]
metadata = top_result.get('metadata', {})
output = f"""# Top Match in FishVectorDB:
**ID:** {top_result.get('id', 'N/A')}
**Species:** {metadata.get('species', 'Unknown')}
**Similarity Score:** {top_result.get('score', 0):.4f}
**Description:** {metadata.get('description', 'No description available')}
**Region:** {metadata.get('region', 'Unknown')}
**Conservation Status:** {metadata.get('conservation_status', 'Unknown')}
"""
return output
...Beautifying a metadata field
We will format the conservation status in a more readable way by adding a color to the text.
Gradio has a HighlightedText component that can be used for this purpose.
There is a get_conservation_status function in src/utils.py which will be used to do the color and text mapping for each corresponding status code.
No need to look too deeply into this file.
We will have to separate the conservation status from the rest of the metadata to be able to use the HighlightedText component.
Then our recognize_fish function will return two outputs: the markdown output and the highlighted conservation status.
Remember to also change the outputs field of the image_input.change action to now have [output, conservation_status] instead of just [output] .
You should see an output like the following:
python
from utils import get_conservation_status
...
with gr.Column():
output = gr.Markdown(label="Search Results")
conservation_status = gr.HighlightedText(
label="Conservation Status",
color_map=get_conservation_status(),
show_legend=True,
show_inline_category=True
)
...python
...
status_code = metadata.get('conservation_status', 'N/A')
highlighted_status = get_conservation_status(status_code)
output = ...
return output, highlighted_status
Part E Exercises
Theming your Gradio app
Change the theme of the Gradio app to a different color scheme.
You can read more about the different themes here .
Formatting other metadata fields
Add at least one other formatted component for the other metadata fields.
You can be as creative as you want. Some possible ideas are:
- Highlighted Text component for the similarity score
- Metadata in a table format
- etc.
Displaying k results
Instead of just displaying the top result, display the top k results in a list format.
Add a slider component to the Gradio app to allow the user to select the number of results to display.
How the list of results is displayed is up to your taste.
ℹ️ Conclusion
This reaches the end of the formal tutorial.
If you have managed to complete all the exercises, you should have picked up on a lot of best practices
for modern development whether you plan to do research engineering, startup building or even general software development.
If you have any feedback or suggestions, please reach out to the contact information at the bottom of the page.
Bonus: Context-aware chatbot
Add a chatbot interface to Gradio that allows the user to ask follow-up questions about the fish species detected.