Video Action Understanding
.p-text {text-align: justify;text-indent: 2em;font-size: 12pt;}.image-cell {width: 20%;text-align: center;vertical-align: center;}Video action understanding uses algorithms to analyze the relationship between movingsubjects, people and people, people and objects, and people and environment in the video and understands the actionsand behaviors of people in the video. Additionally, video action understanding detects the events by combiningthe temporal information of the video. We mainly focus on the four sub-tasks: atomic action segmentation, videoaction recognition, temporal action detection and spatio-temporal action detection.1. Video action recognition is a basic task in video action understanding, which can servedownstream temporal action detection and spatiotemporal action detection tasks. Video action recognition is toidentify the action category in a trimmed video and give the confidence of the action category. Its core is videofeature expression, which usually uses optical flow frames and RGB frames to extract motion information andappearance information in videos. We are committed to extracting video features more efficiently and effectively. Weseek to decouple video features at the spatial and temporal levels at the video level to improve theinterpretability of video features. At the same time, we aim at the huge amount of 3D convolution parameters. Thestructure decomposition of the convolution kernel is used to reduce the amount of model parameters and improve thecalculation efficiency of the model.2. Temporal action detection is to locate the boundary of the action in untrimmed videos, andgive the corresponding action category of the action segment. Temporal action detection has important value in videoediting, video recommendation, and intelligent monitoring. We strive to build more accurate descriptions of actionboundaries and generate high-quality action nomination segments. To propose more accurate action segments, wepropose a more flexible temporal action detection framework. In addition, for the ambiguity of the action boundary,the boundary feature is learned in the video feature expression to highlight the difference between the action andbackground of the video.3. Spatio-Temporal Action Detection is an important research direction in thefield of video behavior understanding. Its goal is to locate and identify interesting actions in videossimultaneously in time and space, which is of great significance for understanding the spatiotemporal evolution modeof actions in videos. Compared with simply detecting the start and end time of an action in time, the detection ofthe spatial position of the action subject on each frame provides a broader practical prospect for the practicalapplication of this research direction. On the basis of the existing frame-level and snippet-level spatio-temporalaction detection methods, we focus on improving the multi-task feature learning ability of the model and improvingthe effect of action modeling. At the same time, we also explore the semi-supervision of spatio-temporal actiondetection in data-scarce scenarios.4. Atomic action segmentation is an important step in understanding the temporal relationshipbetween complex actions. Its goal is to divide a complete action into several atomic actions in time withoutsupervision. These atomic actions constitute a compact representation of the original action in time and a completedescription of semantics. Compared with the rough division of actions under an event in temporal actionsegmentation, unsupervised atomic actions have a more fine-grained division of the original action, which willimprove the ability of action modeling in action recognition and action detection. Based on deep learning, we foucson improving the feature representation ability in unsupervised atomic action segmentation and the ability to modelthe temporal relationship between atomic actions. In addition, we also focus on combining atomic action segmentationwith subsequent video understanding tasks.Selected Papers Constructing Semantical Structure by Segmentation Integrated Video Embedding for Temporal Action DetectionIEEE TCSVT 2025[Project][BibTex]Zhao, Zixuan; Liu, Shuming; Zhao, Chengze; Zhao, XuM3Net: Movement Enhancement with Multi-Relation toward Multi-Scale video representation for Temporal Action DetectionPattern Recognition 2024[BibTex]Zhao, Zixuan; Wang, Dongqi; Zhao, XuMovement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action DetectionICCV 2023[Project][BibTex]Zhao, Zixuan; Wang, Dongqi; Zhao, XuTransferable Knowledge-Based Multi-Granularity Fusion Network for Weakly Supervised Temporal Action DetectionIEEE Transactions on Multimedia 2021[BibTex]Su, Haisheng; Zhao, Xu; Lin, Tianwei; Liu, Shuming; Hu, ZhilanBSN: Boundary Sensitive Network for Temporal Action Proposal GenerationECCV 2018[BibTex]Lin, Tianwei; Zhao, Xu; Su, Haisheng; Wang, Chongjing; Yang, MingSingle Shot Temporal Action DetectionACM MM 2017[BibTex]Lin, Tianwei; Zhao, Xu; Shou, Zheng