====== Python project: Face tracking in video ====== {{tag>dev python moviepy}} Lately, I've been trying to improve a bit on the quality of the videos I'm adding on my youtube channel. As part of this process I'm now using filmora to edit the video rushes, and adding a circular mask and border on my webcam view. Yet, this process is very tedious and time consumming. So the idea I have here is to try to automate this using tools such as moviepy and opencv for instance. Not sure yet the results will be correct, but I should try it anyway. ====== ====== Youtube video for this article available at: ;#; {{ youtube>12gr_XiIqqc?large }} ;#; ===== Reference code ===== Asking our dear friend chatgpt, I first got the following code suggestion: import cv2 from moviepy.editor import VideoFileClip from moviepy.video.io.VideoFileClip import VideoFileClip # Function to detect and track faces using OpenCV def detect_faces(frame): face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml') gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30)) if len(faces) > 0: # Return only the first detected face return faces[0] else: return None # Function to process each frame of the video def process_frame(frame): face_coordinates = detect_faces(frame) if face_coordinates is not None: x, y, w, h = face_coordinates center_x, center_y = x + w // 2, y + h // 2 # Define the region of interest (ROI) around the detected face roi_start_x = max(center_x - w // 2, 0) roi_start_y = max(center_y - h // 2, 0) roi_end_x = min(center_x + w // 2, frame.shape[1]) roi_end_y = min(center_y + h // 2, frame.shape[0]) # Crop and resize the video around the detected face cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0])) return resized_frame else: return frame # Load the video clip video_path = 'path/to/your/video.mp4' video_clip = VideoFileClip(video_path) # Process each frame of the video processed_clip = video_clip.fl_image(process_frame) # Save the processed video output_path = 'path/to/output/video_processed.mp4' processed_clip.write_videofile(output_path, audio=True) => This implies installing moviepy and opencv: pip install moviepy opencv-python I have not tried that code yet, but I'm already wondering if using the "haarcascade_frontalface_default.xml" file would be good enough for instance if I'm turning my head from time to time. So I also asked this question to chatgpt, and got a second version this time using a deep-learning network instead: pip install moviepy opencv-python mtcnn import cv2 from moviepy.editor import VideoFileClip from mtcnn import MTCNN # Function to detect and track faces using MTCNN def detect_faces(frame): detector = MTCNN() faces = detector.detect_faces(frame) if faces: # Return only the first detected face return faces[0]['box'] else: return None # Function to process each frame of the video def process_frame(frame): face_coordinates = detect_faces(frame) if face_coordinates is not None: x, y, w, h = face_coordinates center_x, center_y = x + w // 2, y + h // 2 # Define the region of interest (ROI) around the detected face roi_start_x = max(center_x - w // 2, 0) roi_start_y = max(center_y - h // 2, 0) roi_end_x = min(center_x + w // 2, frame.shape[1]) roi_end_y = min(center_y + h // 2, frame.shape[0]) # Crop and resize the video around the detected face cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0])) return resized_frame else: return frame # Load the video clip video_path = 'path/to/your/video.mp4' video_clip = VideoFileClip(video_path) # Process each frame of the video processed_clip = video_clip.fl_image(process_frame) # Save the processed video output_path = 'path/to/output/video_processed.mp4' processed_clip.write_videofile(output_path, audio=True) ===== Preparing initial skeleton component ===== **Note**: I'm adding this new feature in the **NervProj** project and in there, we already have the **nvp/media/movie_handler.py** component, so I will simply extend that component with an additional command line. Just added the method **process_webcam_view** in that handler: def process_webcam_view(self, input_file): """Method called to process a webcam view in a given video file""" logger.info("Should center face in file %s", input_file) return True And then created the script **handle-camview** to call the command line **process-webcam-view**: handle-camview: notify: false custom_python_env: media_env cmd: ${PYTHON} ${PROJECT_ROOT_DIR}/nvp/media/movie_handler.py process-webcam-view python_path: ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"] And this script can be executed as usual with the command line: nvp handle-camview -i my_input_file.mp4 2024/01/02 02:19:27 [nvp.nvp_compiler] INFO: MSVC root dir is: D:\Softs\VisualStudio\VS2022 2024/01/02 02:19:27 [nvp.nvp_compiler] INFO: Found msvc-14.34.31933 2024/01/02 02:19:27 [nvp.core.build_manager] INFO: Selecting compiler msvc-14.34.31933 (in D:\Softs\VisualStudio\VS2022) 2024/01/02 02:19:28 [__main__] INFO: Should center face in file my_input_file.mp4 ===== Implementing the deep-learning solution ===== First I need to add the **mtcnn** package in the **media_env** python environment: media_env: inherit: default_env packages: - moviepy - Pillow - ffmpeg-python - opencv-python - rembg[gpu] - scipy - drawsvg[all] - hachoir - pyPDF2 - mtcnn additional_modules: cairo.dll: http://files.nervtech.org/nvp_packages/modules/cairo.dll Then we update the installation with the command: nvp pyenv setup media_env And next we write the first version of the process_webcam_view() method: def detect_faces(self, frame): """Helper method to detect a face""" detector = MTCNN() faces = detector.detect_faces(frame) if faces: # Return only the first detected face return faces[0]["box"] else: return None def process_frame(self, frame): """Function to process each frame of the video""" face_coordinates = self.detect_faces(frame) if face_coordinates is not None: x, y, w, h = face_coordinates center_x, center_y = x + w // 2, y + h // 2 # Define the region of interest (ROI) around the detected face roi_start_x = max(center_x - w // 2, 0) roi_start_y = max(center_y - h // 2, 0) roi_end_x = min(center_x + w // 2, frame.shape[1]) roi_end_y = min(center_y + h // 2, frame.shape[0]) # Crop and resize the video around the detected face cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0])) return resized_frame else: return frame def process_webcam_view(self, input_file): """Method called to process a webcam view in a given video file""" logger.info("Should center face in file %s", input_file) video_clip = VideoFileClip(input_file) # Process each frame of the video processed_clip = video_clip.fl_image(self.process_frame) # Save the processed video output_path = self.set_path_extension(input_file, "_centered.mp4") processed_clip.write_videofile(output_path, audio=True) logger.info("Processing done.") return True But the first try then failed due to the missing tensorflow package: nvp handle-camview -i webcam_test.mkv 2024/01/02 02:40:35 [nvp.nvp_compiler] INFO: MSVC root dir is: D:\Softs\VisualStudio\VS2022 2024/01/02 02:40:35 [nvp.nvp_compiler] INFO: Found msvc-14.34.31933 2024/01/02 02:40:35 [nvp.core.build_manager] INFO: Selecting compiler msvc-14.34.31933 (in D:\Softs\VisualStudio\VS2022) Traceback (most recent call last): File "D:\Projects\NervProj\nvp\media\movie_handler.py", line 14, in from mtcnn import MTCNN File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\__init__.py", line 26, in from mtcnn.mtcnn import MTCNN File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\mtcnn.py", line 37, in from mtcnn.network.factory import NetworkFactory File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\network\factory.py", line 26, in from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, PReLU, Flatten, Softmax ModuleNotFoundError: No module named 'tensorflow' 2024/01/02 02:40:36 [nvp.nvp_object] ERROR: Subprocess terminated with error code 1 (cmd=['D:\\Projects\\NervProj\\.pyenvs\\media_env\\python.exe', 'D:\\Projects\\NervProj/nvp/media/movie_handler.py', 'process-webcam-view', '-i', 'webcam_test.mkv']) 2024/01/02 02:40:36 [nvp.components.runner] ERROR: Error occured in script command: cmd=['D:\\Projects\\NervProj\\.pyenvs\\media_env\\python.exe', 'D:\\Projects\\NervProj/nvp/media/movie_handler.py', 'process-webcam-view', '-i', 'webcam_test.mkv'] cwd=Z:\perso\youtube\videos\projects\04_atm_transmittance_unit_test_part2\rushes return code=1 lastest outputs: Traceback (most recent call last): File "D:\Projects\NervProj\nvp\media\movie_handler.py", line 14, in from mtcnn import MTCNN File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\__init__.py", line 26, in from mtcnn.mtcnn import MTCNN File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\mtcnn.py", line 37, in from mtcnn.network.factory import NetworkFactory File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\network\factory.py", line 26, in from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, PReLU, Flatten, Softmax ModuleNotFoundError: No module named 'tensorflow' So I tried to also install the **tensorflow[and-cuda]** package in the media_env environment: media_env: inherit: default_env packages: - moviepy - Pillow - ffmpeg-python - opencv-python - rembg[gpu] - scipy - drawsvg[all] - hachoir - pyPDF2 - mtcnn - tensorflow[and-cuda] additional_modules: cairo.dll: http://files.nervtech.org/nvp_packages/modules/cairo.dll For indications on how to install tensorflow I used this page as reference: https://www.tensorflow.org/install/pip?hl=en => Installation will take some time due to the download of the cuda libraries: Collecting tensorflow[and-cuda] (from -r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15)) Obtaining dependency information for tensorflow[and-cuda] from https://files.pythonhosted.org/packages/df/84/0a67b7ad368b597fa4fc60e2ae2f0fbe9c527c6fe5dbf290236a459fe4a6/tensorflow-2.14.1-cp310-cp310-win_amd64.whl.metadata Downloading tensorflow-2.14.1-cp310-cp310-win_amd64.whl.metadata (3.3 kB) Collecting tensorflow-intel==2.14.1 (from tensorflow[and-cuda]->-r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15)) Obtaining dependency information for tensorflow-intel==2.14.1 from https://files.pythonhosted.org/packages/00/6a/de5fcbab0e4e68142a2e5499f672b05c38e65ccc20ad4508f05bfaac78ea/tensorflow_intel-2.14.1-cp310-cp310-win_amd64.whl.metadata Downloading tensorflow_intel-2.14.1-cp310-cp310-win_amd64.whl.metadata (4.8 kB) Collecting nvidia-cublas-cu11==11.11.3.6 (from tensorflow[and-cuda]->-r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15)) Downloading nvidia_cublas_cu11-11.11.3.6-py3-none-win_amd64.whl (427.2 MB) After that update, the command ''nvp handle-camview -i webcam_test.mkv'' will work, but it will take ages for each frame 🤕! 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 123ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 87ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 33ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 17ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 16ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 14ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 13ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 13ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 14ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 13ms/step 1/5 [=====>........................] - ETA: 0s 5/5 [==============================] - 0s 4ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 78ms/step t: 0%| | 49/39254 [00:52<12:24:14, 1.14s/it, now=None] 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 119ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 84ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 33ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 19ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 17ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 15ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 14ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 14ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 12ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 13ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 13ms/step 1/5 [=====>........................] - ETA: 0s 5/5 [==============================] - 0s 4ms/step 1/1 [==============================] - ETA: 0s 1/1 [==============================] - 0s 78ms/step t: 0%| | 50/39254 [00:53<12:20:06, 1.13s/it, now=None] **Note**: I'm now using a dedicated folder to store this project temp files: **Z:\dev\data\the_lone_engineer\001_face_centering** => ref: **webcam_test_centered_v0.mp4** ===== Investigating how to improve the performances ===== The implementation above is way too slow, so we need to find a way to improve the performances. Here is one suggestion using multiple threads: import cv2 from moviepy.editor import VideoFileClip from mtcnn.mtcnn import MTCNN import concurrent.futures def detect_faces(frame): detector = MTCNN() faces = detector.detect_faces(frame) if faces: return faces[0]['box'] else: return None def process_frame(frame): face_coordinates = detect_faces(frame) if face_coordinates is not None: x, y, w, h = face_coordinates center_x, center_y = x + w // 2, y + h // 2 roi_start_x = max(center_x - w // 2, 0) roi_start_y = max(center_y - h // 2, 0) roi_end_x = min(center_x + w // 2, frame.shape[1]) roi_end_y = min(center_y + h // 2, frame.shape[0]) cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0])) return resized_frame else: return frame def process_video(video_path, output_path): video_clip = VideoFileClip(video_path) frames = [frame for frame in video_clip.iter_frames()] with concurrent.futures.ThreadPoolExecutor() as executor: processed_frames = list(executor.map(process_frame, frames)) processed_clip = VideoFileClip(video_path).fl(processed_frames) processed_clip.write_videofile(output_path, audio=True) Unfortunately this option is not really working properly as it will require to extract **all** the frames from the video clip as numpy array, and this resulted in the python process using more than 20GB on my side and freezing my system. => I really think I need to process the frames one by one to extract the face position, but the key is maybe to not process every single frame: instead I could maybe process 1 frame every 60 frames or even 120 frames (eg. every sec or every 2secs) Here is the updated function I wrote to compute tha face position only once every 60 frames: def process_frame(self, frame): """Function to process each frame of the video""" logger.info("Processing frame %d", self.frame_index) # Check if we should detect the face or not: if self.frame_index % 60 == 0: face_coordinates = self.detect_faces(frame) if face_coordinates is not None: x, y, w, h = face_coordinates center_x, center_y = x + w // 2, y + h // 2 self.target_face_cx = center_x self.target_face_cy = center_y # Init the face coords if needed: if self.face_cx is None: self.face_cx = center_x if self.face_cy is None: self.face_cy = center_y if self.target_face_cx is not None: self.face_cx += (self.target_face_cx - self.face_cx) * 0.1 self.face_cy += (self.target_face_cy - self.face_cy) * 0.1 # Define the region of interest (ROI) around the detected face hsize = (self.frame_size * 3) // 2 fcx = self.face_cx or frame.shape[1] // 2 fcy = self.face_cy or frame.shape[0] // 2 cx = min(max(fcx, hsize), frame.shape[1] - hsize) cy = min(max(fcy, hsize), frame.shape[0] - hsize) roi_start_x = int(cx - hsize) roi_start_y = int(cy - hsize) roi_end_x = int(cx + hsize) roi_end_y = int(cy + hsize) # Crop and resize the video around the detected face cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (self.frame_size, self.frame_size)) self.frame_index += 1 return resized_frame For that first working version I used a target frame size of 256px with a source frame size of 256x3 pixels: I should make this configurable instead. => ref: **webcam_test_centered_v1.mp4** And also, for the interpolation I should use a windowed mean class. ===== Using WindowedMean for interpolation of the face position ===== Here is the WindowedMean class implementation: """WindowedMean class definition""" from collections import deque class WindowedMean: """ Class for efficiently computing a windowed mean on the last N values from a measurement. Attributes: window_size (int): The size of the window for computing the mean. values (deque): A deque to store the last N measurements. sum (float): The sum of the values in the window. Methods: add_value(value): Adds a new value to the window and updates the mean. get_mean(): Calculates and returns the mean of the values in the window. """ def __init__(self, window_size): """ Initialize the WindowedMean object. Args: window_size (int): The size of the window for computing the mean. """ self.window_size = window_size self.values = deque(maxlen=window_size) self.sum = 0.0 def add_value(self, value): """ Add a new value to the window and update the mean. Args: value: The new measurement value. """ if len(self.values) == self.window_size: # Subtract the oldest value from the sum self.sum -= self.values[0] # Add the new value to the deque and the sum self.values.append(value) self.sum += value def get_mean(self): """ Calculate and return the mean of the values in the window. Returns: float: The computed mean. """ if not self.values: return None # Return None if no values are available return self.sum / len(self.values) And now using that for the face cx/cy interpolation: if self.target_face_cx is not None: self.face_cx.add_value(self.target_face_cx) self.face_cy.add_value(self.target_face_cy) # Define the region of interest (ROI) around the detected face hsize = (self.frame_size * 3) // 2 fcx = self.face_cx.get_mean() or frame.shape[1] // 2 fcy = self.face_cy.get_mean() or frame.shape[0] // 2 => ref: **webcam_test_centered_v2.mp4** ===== Testing with facenet_pytorch package ===== I installed the required packages in my media_env environment => See this page form usage of the MTCNN component in **facenet_pytorch** package: https://www.kaggle.com/code/timesler/guide-to-mtcnn-in-facenet-pytorch/notebook Updated the detect_face method: def detect_faces(self, frame): """Helper method to detect a face""" if self.face_detector is None: logger.info("Initializing MTCNN detector...") self.face_detector = MTCNN( device="cuda", select_largest=False, post_process=False, ) # self.detect_face_func = capture_output(self.face_detector.detect_faces) # self.detect_face_func = capture_output(self.face_detector.detect) self.detect_face_func = self.face_detector.detect boxes, _ = self.detect_face_func(frame) # faces = self.detect_face_func(frame) # if faces: # # Return only the first detected face # return faces[0]["box"] if boxes is not None: return boxes[0] else: return None And also the face coordinates computation: if face_coordinates is not None: # x, y, w, h = face_coordinates left, top, right, bottom = face_coordinates # center_x, center_y = x + w // 2, y + h // 2 center_x, center_y = (left + right) / 2.0, (top + bottom) / 2.0 => ref: **webcam_test_centered_v3.mp4** This implementation works as well as the previous version and it is significantly faster (about 106it/sec vs 60it/secs) ===== Removing tracking delay ===== Next point I want to cover is to try to cancel the current interpolation delay that we get due to the fact that we take 60 frames to get to the correct center position. But to acheive that, we will need to perform the face detection/extraction in 2 passes. **OK**, now preprocessing the frames to extract the face positions, and then using the **interp1d** function from scipy to perform the interpolation: def process_frame(self, frame): """Function to process each frame of the video""" if self.frame_index % self.face_window_len == 0: logger.info("Processing frame %d", self.frame_index) # Perform interpolation to get the face x/y position: fcx = self.interp_fcx_func(self.frame_index) fcy = self.interp_fcy_func(self.frame_index) # Define the region of interest (ROI) around the detected face hsize = (self.frame_size * 3) // 2 cx = min(max(fcx, hsize), frame.shape[1] - hsize) cy = min(max(fcy, hsize), frame.shape[0] - hsize) roi_start_x = int(cx - hsize) roi_start_y = int(cy - hsize) roi_end_x = int(cx + hsize) roi_end_y = int(cy + hsize) # Crop and resize the video around the detected face cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x] resized_frame = cv2.resize(cropped_frame, (self.frame_size, self.frame_size)) self.frame_index += 1 return resized_frame def collect_face_position(self, frame, nframes): """Collect the face position for a given frame""" if self.frame_index % self.face_window_len == 0: logger.info("Collecting face at frame %d/%d", self.frame_index, nframes) face_coordinates = self.detect_faces(frame) if face_coordinates is not None: # x, y, w, h = face_coordinates left, top, right, bottom = face_coordinates # center_x, center_y = x + w // 2, y + h // 2 center_x, center_y = (left + right) / 2.0, (top + bottom) / 2.0 self.frame_indices.append(self.frame_index) self.face_pos_x.append(center_x) self.face_pos_y.append(center_y) self.frame_index += 1 return frame => ref: **webcam_test2_centered_v4_180f.mp4** and **webcam_test2_centered_v4_60f.mp4** and **webcam_test2_centered_v4_60f_linear.mp4** **Note**: I tested both "cubic" and "linear" interpolations and the linear version seems a bit more appropriate to me 🤔. ===== Dynamic face scaling ===== Next point I would like to cover is on the face scale control: currently we use a fixed scale of for instance 3 time the destination clip size. But I think it could be interesting to use instead a dynamic scaling process based on the size of the face on the source image. And we could then dynamically change the size of the source area depending on the face size. Added smooth interpolation for the frame source size: # Perform interpolation to get the face x/y position: cx = self.interp_fcx_func(self.frame_index) cy = self.interp_fcy_func(self.frame_index) fsize = self.interp_fsize_func(self.frame_index) # logger.info("source frame size: %f", fsize) # Define the region of interest (ROI) around the detected face # hsize = (self.frame_size * 3) // 2 self.current_fsize += (fsize - self.current_fsize) * 0.003 hsize = self.current_fsize * 3.0 / 2.0 hsize = min(hsize, cx, cy, frame.shape[1] - cx, frame.shape[0] - cy) # cx = min(max(fcx, hsize), frame.shape[1] - hsize) # cy = min(max(fcy, hsize), frame.shape[0] - hsize) **Note**: I also tweaked a lot the frame_window_size, scaling adaptation factor, and source frame size factor, etc. Now I seem to have some appropriate settings ;-). => ref: **webcam_test2_centered_v5.mp4** ===== Writting to webm file with transparency ? ===== To write webm files we simply change the output file extension: output_path = self.set_path_extension(input_file, "_centered.webm") And we change the codec: # Save the processed video # vcodec="libvpx-vp9" vcodec = "libvpx" processed_clip.write_videofile( output_path, audio=True, codec=vcodec, audio_codec="libvorbis", bitrate="700k", threads=4 ) **Note**: libvpx-vp9 seems significantly slower than libvpx, so we stick to the later for now. Arrgghh... to support transparency, we need to specify the pixel format for ffmpeg: processed_clip.write_videofile( output_path, audio=True, codec=vcodec, audio_codec="libvorbis", bitrate="700k", threads=4, ffmpeg_params=["-pix_fmt", "yuva420p"], ) But, unfortunately, this will then generate an error while writing the output with the libvpx codec: MoviePy error: FFMPEG encountered the following error while writing file webcam_test2_centered.webm: b'[libvpx @ 000001e590991c00] Transparency encoding with auto_alt_ref does not work\r\nError initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height\r\n' => To fix this issue it seems we really need to use the **libvpx-vp9** codec. OK, this doesn't work so well at playback unfortunately. So trying another codec (cf. https://stackoverflow.com/questions/75455066/convert-transparent-pngs-to-a-transparent-video) **Note**: I was generating a circular mask as follow: # Function to create a circular mask def make_circular_mask(self, size, center, radius): """Method to generate a circular mask""" mask = Image.new("L", size, 0) draw = ImageDraw.Draw(mask) draw.ellipse((center[0] - radius, center[1] - radius, center[0] + radius, center[1] + radius), fill=1) return np.array(mask).astype(np.uint8) Tested the **prores** / **prores_ks** codecs, with the "yuva444p10le" pixel format: **not working at playback** Also checked that page: https://curiosalon.github.io/blog/ffmpeg-alpha-masking/ => In the end, nothing seems to work as expected here and I cannot generate a video with with transparency at playback 🤕.