X-AnyLabeling is a powerful annotation tool integrated with an AI engine for fast and automatic labeling. Designed for multi-modal data engineers, it offers industrial-grade solutions for complex tasks. Supports images and videos, GPU acceleration, custom models, one-click inference for all task images, and import/export formats like COCO, VOC, YOLO. Handles classification, detection, segmentation, captioning, rotation, tracking, estimation, OCR, VQA, grounding, etc., with various annotation styles including polygons, rectangles, rotated boxes.
MiniMind-V is an open-source tiny visual-language model (VLM) project that demonstrates how to train a 26M-parameter multimodal VLM from scratch quickly and cheaply (example: ~1 hour / single 3090 GPU and very low rental cost). The repo provides end-to-end code for data cleaning, pretraining, supervised fine-tuning (SFT), evaluation and demo, using CLIP as the visual encoder and MiniMind as the base LLM.
Stanford’s 10-week CS231n dives from first principles to state-of-the-art vision research, starting with image-classification basics, loss functions and optimization, then building from fully-connected nets to modern CNNs, residual and vision-transformer architectures. Lectures span training tricks, regularization, visualization, transfer learning, detection, segmentation, video, 3-D and generative models. Three hands-on PyTorch assignments guide students from k-NN/SVM through deep CNNs and network visualization, and a capstone project lets teams train large-scale models on a vision task of their choice, graduating with the skills to design, debug and deploy real-world deep-learning pipelines.