VT512

A 512 bit Vision Transformer with MobileNetV3 CNN for feature extraction.

This is not an official research project with the Oregon State University.

WARNING

This is a work in progress, I thought having AI would make it possible to complete in a day but it is certainly slowing me down with horrible response. Due to the lack of time, this project does not come with a testbench and have not been verified.

Setup

To setup environment after completing Caravel setup, run the following command in the root directory of the project.

source setup.sh

Introduction

This is a simple lightweight Vision Transformer for image classification. The model uses MobileNetV3 as the CNN for feature extraction and ViT as the Transformer Encoder.

Architecture

graph LR
    subgraph Raspberry Pi Camera System
        RaspberryPi
    end

    subgraph Caravel
        subgraph ViT
            subgraph MobileNetV3
                input_image --> |CNN Feature Extraction| extracted_features
            end

            subgraph Vision Transformer
                extracted_features --> |Patch Embeddings| patches
                patches --> |Positional Encoding| encoded_patches
                encoded_patches --> |Transformer Encoder| transformed_patches
                transformed_patches --> |Classification Head| output_classes
            end

            extracted_features --> |Skip Connection| transformed_patches
            output_classes --> |Classification| prediction
        end
    end

    RaspberryPi --> input_image
    prediction --> RaspberryPi

    style RaspberryPi fill:#66ff66
    style MobileNetV3 fill:#66ccff
    style VisionTransformer fill:#ff9966
    style ViT fill:#ffcc66

Configuration

Image Capture

The Image Capture module capture 1 pixel at a time and store it to the memory using wishbone interconnect which comes with a 32 bit bus.

Correction, Caravel Clock is 40MHz, not 100MHz. That means 10 million pixel per second which translates to 0.0262144 seconds per frame or 38.1469726563 frames per second. Not ideal and have little budget for overhead. GPIO maximum is 50MHz which is 47.6837158203 frame per second so still not enough. That means we can only stream image at 30fps if returning labels, or 15fps if we are returning image.

This module support the following configurations:

1 Channel Image (Grayscale)
3 Channels Image (RGB888 or YUV888)
4 Channels Image (RGB + Grayscale or even RGB + Alpha)

To use the module in the specific format, the configuration register must be set to the following value:

Channel	Register Value
1	32'hXXXX_XXX1
3	32'hXXXX_XXX3
4	32'hXXXX_XXX4

Wishbone Image Data Structure:

Bit Range	[31:24]	[23:16]	[15:8]	[7:0]
Channel Used	Channel 4	Channel 3	Channel 2	Channel 1
Operation Mode	Mode 3 & 4	Mode 3 & 4	Mode 3 & 4	Mode 1 & 4
Example Mode 1				Grayscale
Example Mode 3	R	G	B
Example Mode 4	R	G	B	A

NOTE: The channel in this section is the color channel of the image.

CNN Feature Extraction

CNN Architecture Configuration

The CNN Feature Extraction uses a lightweight user configurable CNN model. The CNN model is configured using the configuration register. The configuration register is a 32 bit register with the following format:

Parameter	Register Bits	Description
Filter	[7:4]	See Operation Mode
Initial Depth	[11:8]	Number of First Layer Filters
Filter Depth	[15:12]	Filter Number Multiples
Conv Layers	[19:16]	Number of layers up to 16
Pooling Layers	[23:20]	Number of layers up to 16
Pooling Interval	[27:24]	Pooling Interval (Unchecked)

Operation Modes

This module support the following CNN modes: P mode and V mode. The CNN mode can be selected by setting the configuration register to the following value:

Mode	Register Value
P	32'hXXXX_XX3X
V	32'hXXXX_XX4X

P Mode

The P mode is the default mode for the CNN. It uses a padded image with 3x3 filter and a stride of 1 to extract features.

Parameter	Value
Filter	3x3
Stride	1
Padding	1
Image	510x510
Output	510x510

V Mode

The V mode is an optional mode for the CNN. It uses a padded image with 4x4 filter and a stride of 1 to extract features.

Parameter	Value
Filter	4x4
Stride	1
Padding	0
Image	512x512
Output	509x509

References

This project is created with the help of ChatGPT May 24 Version as required by the the AI Generated Open-Source Silicon Design Challenge. The prompts used to create this project is available at https://chat.openai.com/share/97b14e4b-678d-4793-92a2-292723c7b540 Minor modifications are made by human to optimize and correct the design. This design uses the following references:

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgement

The project is created as a external image classification module to complement a Raspberry Pi camera system as a fall detection system to identify and alert when my grandma falls. This project is also inspired by my research project on light weight transformers at Oregon State University System Technology and Application Research Lab under the supervision of Dr. Lizhong Chen, and also related to my class project on Vision Transformer for AI535 Deep Learning taught by Dr. Stephen Lee.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
caravel		caravel
def		def
dependencies		dependencies
docs		docs
gds		gds
lef		lef
lib		lib
mag		mag
maglef		maglef
mgmt_core_wrapper		mgmt_core_wrapper
openlane		openlane
signoff		signoff
spi/lvs		spi/lvs
venv		venv
verilog		verilog
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
init.sh		init.sh

License

Anthonykung/VT512

Folders and files

Latest commit

History

Repository files navigation

VT512

WARNING

Setup

Introduction

Architecture

Configuration

Image Capture

CNN Feature Extraction

CNN Architecture Configuration

Operation Modes

P Mode

V Mode

References

License

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

Languages