A 512 bit Vision Transformer with MobileNetV3
CNN for feature extraction.
This is not an official research project with the Oregon State University.
This is a work in progress, I thought having AI would make it possible to complete in a day but it is certainly slowing me down with horrible response. Due to the lack of time, this project does not come with a testbench and have not been verified.
To setup environment after completing Caravel setup, run the following command in the root directory of the project.
source setup.sh
This is a simple lightweight Vision Transformer for image classification. The model uses MobileNetV3
as the CNN for feature extraction and ViT
as the Transformer Encoder.
graph LR
subgraph Raspberry Pi Camera System
RaspberryPi
end
subgraph Caravel
subgraph ViT
subgraph MobileNetV3
input_image --> |CNN Feature Extraction| extracted_features
end
subgraph Vision Transformer
extracted_features --> |Patch Embeddings| patches
patches --> |Positional Encoding| encoded_patches
encoded_patches --> |Transformer Encoder| transformed_patches
transformed_patches --> |Classification Head| output_classes
end
extracted_features --> |Skip Connection| transformed_patches
output_classes --> |Classification| prediction
end
end
RaspberryPi --> input_image
prediction --> RaspberryPi
style RaspberryPi fill:#66ff66
style MobileNetV3 fill:#66ccff
style VisionTransformer fill:#ff9966
style ViT fill:#ffcc66
- ImageCapture
- CNN Feature Extraction
- Patch Embeddings
- Positional Encoding
- Transformer Encoder
- Classification Head
- Classification
The Image Capture module capture 1 pixel at a time and store it to the memory using wishbone interconnect which comes with a 32 bit bus.
Correction, Caravel Clock is 40MHz, not 100MHz. That means 10 million pixel per second which translates to 0.0262144 seconds per frame or 38.1469726563 frames per second. Not ideal and have little budget for overhead. GPIO maximum is 50MHz which is 47.6837158203 frame per second so still not enough. That means we can only stream image at 30fps if returning labels, or 15fps if we are returning image.
This module support the following configurations:
- 1 Channel Image (Grayscale)
- 3 Channels Image (RGB888 or YUV888)
- 4 Channels Image (RGB + Grayscale or even RGB + Alpha)
To use the module in the specific format, the configuration register must be set to the following value:
Channel | Register Value |
---|---|
1 | 32'hXXXX_XXX1 |
3 | 32'hXXXX_XXX3 |
4 | 32'hXXXX_XXX4 |
Wishbone Image Data Structure:
Bit Range | [31:24] | [23:16] | [15:8] | [7:0] |
---|---|---|---|---|
Channel Used | Channel 4 | Channel 3 | Channel 2 | Channel 1 |
Operation Mode | Mode 3 & 4 | Mode 3 & 4 | Mode 3 & 4 | Mode 1 & 4 |
Example Mode 1 | Grayscale | |||
Example Mode 3 | R | G | B | |
Example Mode 4 | R | G | B | A |
NOTE: The channel
in this section is the color channel of the image.
The CNN Feature Extraction uses a lightweight user configurable CNN model. The CNN model is configured using the configuration register. The configuration register is a 32 bit register with the following format:
Parameter | Register Bits | Description |
---|---|---|
Filter | [7:4] | See Operation Mode |
Initial Depth | [11:8] | Number of First Layer Filters |
Filter Depth | [15:12] | Filter Number Multiples |
Conv Layers | [19:16] | Number of layers up to 16 |
Pooling Layers | [23:20] | Number of layers up to 16 |
Pooling Interval | [27:24] | Pooling Interval (Unchecked) |
This module support the following CNN modes: P mode and V mode. The CNN mode can be selected by setting the configuration register to the following value:
Mode | Register Value |
---|---|
P | 32'hXXXX_XX3X |
V | 32'hXXXX_XX4X |
The P mode is the default mode for the CNN. It uses a padded image with 3x3 filter and a stride of 1 to extract features.
Parameter | Value |
---|---|
Filter | 3x3 |
Stride | 1 |
Padding | 1 |
Image | 510x510 |
Output | 510x510 |
The V mode is an optional mode for the CNN. It uses a padded image with 4x4 filter and a stride of 1 to extract features.
Parameter | Value |
---|---|
Filter | 4x4 |
Stride | 1 |
Padding | 0 |
Image | 512x512 |
Output | 509x509 |
This project is created with the help of ChatGPT May 24 Version as required by the the AI Generated Open-Source Silicon Design Challenge. The prompts used to create this project is available at https://chat.openai.com/share/97b14e4b-678d-4793-92a2-292723c7b540 Minor modifications are made by human to optimize and correct the design. This design uses the following references:
Copyright 2023 Anthony Kung kungc@oregonstate.edu
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The project is created as a external image classification module to complement a Raspberry Pi camera system as a fall detection system to identify and alert when my grandma falls. This project is also inspired by my research project on light weight transformers at Oregon State University System Technology and Application Research Lab under the supervision of Dr. Lizhong Chen, and also related to my class project on Vision Transformer for AI535 Deep Learning taught by Dr. Stephen Lee.