blog:2024:0105_face_tracking_in_video

Python project: Face tracking in video

Lately, I've been trying to improve a bit on the quality of the videos I'm adding on my youtube channel. As part of this process I'm now using filmora to edit the video rushes, and adding a circular mask and border on my webcam view. Yet, this process is very tedious and time consumming. So the idea I have here is to try to automate this using tools such as moviepy and opencv for instance. Not sure yet the results will be correct, but I should try it anyway.

Youtube video for this article available at:

Asking our dear friend chatgpt, I first got the following code suggestion:

import cv2
from moviepy.editor import VideoFileClip
from moviepy.video.io.VideoFileClip import VideoFileClip

# Function to detect and track faces using OpenCV
def detect_faces(frame):
    face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    
    if len(faces) > 0:
        # Return only the first detected face
        return faces[0]
    else:
        return None

# Function to process each frame of the video
def process_frame(frame):
    face_coordinates = detect_faces(frame)
    
    if face_coordinates is not None:
        x, y, w, h = face_coordinates
        center_x, center_y = x + w // 2, y + h // 2
        
        # Define the region of interest (ROI) around the detected face
        roi_start_x = max(center_x - w // 2, 0)
        roi_start_y = max(center_y - h // 2, 0)
        roi_end_x = min(center_x + w // 2, frame.shape[1])
        roi_end_y = min(center_y + h // 2, frame.shape[0])
        
        # Crop and resize the video around the detected face
        cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
        resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0]))
        
        return resized_frame
    else:
        return frame

# Load the video clip
video_path = 'path/to/your/video.mp4'
video_clip = VideoFileClip(video_path)

# Process each frame of the video
processed_clip = video_clip.fl_image(process_frame)

# Save the processed video
output_path = 'path/to/output/video_processed.mp4'
processed_clip.write_videofile(output_path, audio=True)

⇒ This implies installing moviepy and opencv:

pip install moviepy opencv-python

I have not tried that code yet, but I'm already wondering if using the “haarcascade_frontalface_default.xml” file would be good enough for instance if I'm turning my head from time to time. So I also asked this question to chatgpt, and got a second version this time using a deep-learning network instead:

pip install moviepy opencv-python mtcnn

import cv2
from moviepy.editor import VideoFileClip
from mtcnn import MTCNN

# Function to detect and track faces using MTCNN
def detect_faces(frame):
    detector = MTCNN()
    faces = detector.detect_faces(frame)
    
    if faces:
        # Return only the first detected face
        return faces[0]['box']
    else:
        return None

# Function to process each frame of the video
def process_frame(frame):
    face_coordinates = detect_faces(frame)
    
    if face_coordinates is not None:
        x, y, w, h = face_coordinates
        center_x, center_y = x + w // 2, y + h // 2
        
        # Define the region of interest (ROI) around the detected face
        roi_start_x = max(center_x - w // 2, 0)
        roi_start_y = max(center_y - h // 2, 0)
        roi_end_x = min(center_x + w // 2, frame.shape[1])
        roi_end_y = min(center_y + h // 2, frame.shape[0])
        
        # Crop and resize the video around the detected face
        cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
        resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0]))
        
        return resized_frame
    else:
        return frame

# Load the video clip
video_path = 'path/to/your/video.mp4'
video_clip = VideoFileClip(video_path)

# Process each frame of the video
processed_clip = video_clip.fl_image(process_frame)

# Save the processed video
output_path = 'path/to/output/video_processed.mp4'
processed_clip.write_videofile(output_path, audio=True)

Note: I'm adding this new feature in the NervProj project and in there, we already have the nvp/media/movie_handler.py component, so I will simply extend that component with an additional command line.

Just added the method process_webcam_view in that handler:

    def process_webcam_view(self, input_file):
        """Method called to process a webcam view in a given video file"""
        logger.info("Should center face in file %s", input_file)

        return True

And then created the script handle-camview to call the command line process-webcam-view:

  handle-camview:
    notify: false
    custom_python_env: media_env
    cmd: ${PYTHON} ${PROJECT_ROOT_DIR}/nvp/media/movie_handler.py process-webcam-view
    python_path: ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"]

And this script can be executed as usual with the command line:

nvp handle-camview -i my_input_file.mp4
2024/01/02 02:19:27 [nvp.nvp_compiler] INFO: MSVC root dir is: D:\Softs\VisualStudio\VS2022
2024/01/02 02:19:27 [nvp.nvp_compiler] INFO: Found msvc-14.34.31933
2024/01/02 02:19:27 [nvp.core.build_manager] INFO: Selecting compiler msvc-14.34.31933 (in D:\Softs\VisualStudio\VS2022)
2024/01/02 02:19:28 [__main__] INFO: Should center face in file my_input_file.mp4

First I need to add the mtcnn package in the media_env python environment:

  media_env:
    inherit: default_env
    packages:
      - moviepy
      - Pillow
      - ffmpeg-python
      - opencv-python
      - rembg[gpu]
      - scipy
      - drawsvg[all]
      - hachoir
      - pyPDF2
      - mtcnn
    additional_modules:
      cairo.dll: http://files.nervtech.org/nvp_packages/modules/cairo.dll

Then we update the installation with the command:

nvp pyenv setup media_env

And next we write the first version of the process_webcam_view() method:

    def detect_faces(self, frame):
        """Helper method to detect a face"""
        detector = MTCNN()
        faces = detector.detect_faces(frame)

        if faces:
            # Return only the first detected face
            return faces[0]["box"]
        else:
            return None

    def process_frame(self, frame):
        """Function to process each frame of the video"""
        face_coordinates = self.detect_faces(frame)

        if face_coordinates is not None:
            x, y, w, h = face_coordinates
            center_x, center_y = x + w // 2, y + h // 2

            # Define the region of interest (ROI) around the detected face
            roi_start_x = max(center_x - w // 2, 0)
            roi_start_y = max(center_y - h // 2, 0)
            roi_end_x = min(center_x + w // 2, frame.shape[1])
            roi_end_y = min(center_y + h // 2, frame.shape[0])

            # Crop and resize the video around the detected face
            cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
            resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0]))

            return resized_frame
        else:
            return frame

    def process_webcam_view(self, input_file):
        """Method called to process a webcam view in a given video file"""
        logger.info("Should center face in file %s", input_file)

        video_clip = VideoFileClip(input_file)

        # Process each frame of the video
        processed_clip = video_clip.fl_image(self.process_frame)

        # Save the processed video
        output_path = self.set_path_extension(input_file, "_centered.mp4")
        processed_clip.write_videofile(output_path, audio=True)

        logger.info("Processing done.")
        return True

But the first try then failed due to the missing tensorflow package:

nvp handle-camview -i webcam_test.mkv
2024/01/02 02:40:35 [nvp.nvp_compiler] INFO: MSVC root dir is: D:\Softs\VisualStudio\VS2022
2024/01/02 02:40:35 [nvp.nvp_compiler] INFO: Found msvc-14.34.31933
2024/01/02 02:40:35 [nvp.core.build_manager] INFO: Selecting compiler msvc-14.34.31933 (in D:\Softs\VisualStudio\VS2022)
Traceback (most recent call last):
  File "D:\Projects\NervProj\nvp\media\movie_handler.py", line 14, in <module>
    from mtcnn import MTCNN
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\__init__.py", line 26, in <module>
    from mtcnn.mtcnn import MTCNN
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\mtcnn.py", line 37, in <module>
    from mtcnn.network.factory import NetworkFactory
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\network\factory.py", line 26, in <module>
    from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, PReLU, Flatten, Softmax
ModuleNotFoundError: No module named 'tensorflow'
2024/01/02 02:40:36 [nvp.nvp_object] ERROR: Subprocess terminated with error code 1 (cmd=['D:\\Projects\\NervProj\\.pyenvs\\media_env\\python.exe', 'D:\\Projects\\NervProj/nvp/media/movie_handler.py', 'process-webcam-view', '-i', 'webcam_test.mkv'])
2024/01/02 02:40:36 [nvp.components.runner] ERROR: Error occured in script command:
cmd=['D:\\Projects\\NervProj\\.pyenvs\\media_env\\python.exe', 'D:\\Projects\\NervProj/nvp/media/movie_handler.py', 'process-webcam-view', '-i', 'webcam_test.mkv']
cwd=Z:\perso\youtube\videos\projects\04_atm_transmittance_unit_test_part2\rushes
return code=1
lastest outputs:
Traceback (most recent call last):
  File "D:\Projects\NervProj\nvp\media\movie_handler.py", line 14, in <module>
    from mtcnn import MTCNN
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\__init__.py", line 26, in <module>
    from mtcnn.mtcnn import MTCNN
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\mtcnn.py", line 37, in <module>
    from mtcnn.network.factory import NetworkFactory
  File "D:\Projects\NervProj\.pyenvs\media_env\lib\site-packages\mtcnn\network\factory.py", line 26, in <module>
    from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, PReLU, Flatten, Softmax
ModuleNotFoundError: No module named 'tensorflow'

So I tried to also install the tensorflow[and-cuda] package in the media_env environment:

  media_env:
    inherit: default_env
    packages:
      - moviepy
      - Pillow
      - ffmpeg-python
      - opencv-python
      - rembg[gpu]
      - scipy
      - drawsvg[all]
      - hachoir
      - pyPDF2
      - mtcnn
      - tensorflow[and-cuda]
    additional_modules:
      cairo.dll: http://files.nervtech.org/nvp_packages/modules/cairo.dll

For indications on how to install tensorflow I used this page as reference: https://www.tensorflow.org/install/pip?hl=en

⇒ Installation will take some time due to the download of the cuda libraries:

Collecting tensorflow[and-cuda] (from -r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15))
  Obtaining dependency information for tensorflow[and-cuda] from https://files.pythonhosted.org/packages/df/84/0a67b7ad368b597fa4fc60e2ae2f0fbe9c527c6fe5dbf290236a459fe4a6/tensorflow-2.14.1-cp310-cp310-win_amd64.whl.metadata
  Downloading tensorflow-2.14.1-cp310-cp310-win_amd64.whl.metadata (3.3 kB)
Collecting tensorflow-intel==2.14.1 (from tensorflow[and-cuda]->-r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15))
  Obtaining dependency information for tensorflow-intel==2.14.1 from https://files.pythonhosted.org/packages/00/6a/de5fcbab0e4e68142a2e5499f672b05c38e65ccc20ad4508f05bfaac78ea/tensorflow_intel-2.14.1-cp310-cp310-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.14.1-cp310-cp310-win_amd64.whl.metadata (4.8 kB)
Collecting nvidia-cublas-cu11==11.11.3.6 (from tensorflow[and-cuda]->-r D:\Projects\NervProj\.pyenvs\media_env\requirements.txt (line 15))
  Downloading nvidia_cublas_cu11-11.11.3.6-py3-none-win_amd64.whl (427.2 MB)

After that update, the command nvp handle-camview -i webcam_test.mkv will work, but it will take ages for each frame 🤕!

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 123ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 87ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 33ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 23ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 20ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 17ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 16ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 14ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 13ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 13ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 14ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 13ms/step

1/5 [=====>........................] - ETA: 0s
5/5 [==============================] - 0s 4ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 78ms/step
t:   0%|          | 49/39254 [00:52<12:24:14,  1.14s/it, now=None]
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 119ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 84ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 33ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 23ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 19ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 17ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 15ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 14ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 14ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 12ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 13ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 13ms/step

1/5 [=====>........................] - ETA: 0s
5/5 [==============================] - 0s 4ms/step

1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 78ms/step
t:   0%|          | 50/39254 [00:53<12:20:06,  1.13s/it, now=None]

Note: I'm now using a dedicated folder to store this project temp files: Z:\dev\data\the_lone_engineer\001_face_centering

⇒ ref: webcam_test_centered_v0.mp4

The implementation above is way too slow, so we need to find a way to improve the performances.

Here is one suggestion using multiple threads:

import cv2
from moviepy.editor import VideoFileClip
from mtcnn.mtcnn import MTCNN
import concurrent.futures

def detect_faces(frame):
    detector = MTCNN()
    faces = detector.detect_faces(frame)
    
    if faces:
        return faces[0]['box']
    else:
        return None

def process_frame(frame):
    face_coordinates = detect_faces(frame)
    
    if face_coordinates is not None:
        x, y, w, h = face_coordinates
        center_x, center_y = x + w // 2, y + h // 2
        
        roi_start_x = max(center_x - w // 2, 0)
        roi_start_y = max(center_y - h // 2, 0)
        roi_end_x = min(center_x + w // 2, frame.shape[1])
        roi_end_y = min(center_y + h // 2, frame.shape[0])
        
        cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
        resized_frame = cv2.resize(cropped_frame, (frame.shape[1], frame.shape[0]))
        
        return resized_frame
    else:
        return frame

def process_video(video_path, output_path):
    video_clip = VideoFileClip(video_path)
    frames = [frame for frame in video_clip.iter_frames()]

    with concurrent.futures.ThreadPoolExecutor() as executor:
        processed_frames = list(executor.map(process_frame, frames))

    processed_clip = VideoFileClip(video_path).fl(processed_frames)
    processed_clip.write_videofile(output_path, audio=True)

Unfortunately this option is not really working properly as it will require to extract all the frames from the video clip as numpy array, and this resulted in the python process using more than 20GB on my side and freezing my system.

⇒ I really think I need to process the frames one by one to extract the face position, but the key is maybe to not process every single frame: instead I could maybe process 1 frame every 60 frames or even 120 frames (eg. every sec or every 2secs)

Here is the updated function I wrote to compute tha face position only once every 60 frames:

    def process_frame(self, frame):
        """Function to process each frame of the video"""
        logger.info("Processing frame %d", self.frame_index)

        # Check if we should detect the face or not:
        if self.frame_index % 60 == 0:
            face_coordinates = self.detect_faces(frame)

            if face_coordinates is not None:
                x, y, w, h = face_coordinates
                center_x, center_y = x + w // 2, y + h // 2

                self.target_face_cx = center_x
                self.target_face_cy = center_y

                # Init the face coords if needed:
                if self.face_cx is None:
                    self.face_cx = center_x
                if self.face_cy is None:
                    self.face_cy = center_y

        if self.target_face_cx is not None:
            self.face_cx += (self.target_face_cx - self.face_cx) * 0.1
            self.face_cy += (self.target_face_cy - self.face_cy) * 0.1

        # Define the region of interest (ROI) around the detected face
        hsize = (self.frame_size * 3) // 2

        fcx = self.face_cx or frame.shape[1] // 2
        fcy = self.face_cy or frame.shape[0] // 2

        cx = min(max(fcx, hsize), frame.shape[1] - hsize)
        cy = min(max(fcy, hsize), frame.shape[0] - hsize)

        roi_start_x = int(cx - hsize)
        roi_start_y = int(cy - hsize)
        roi_end_x = int(cx + hsize)
        roi_end_y = int(cy + hsize)

        # Crop and resize the video around the detected face
        cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
        resized_frame = cv2.resize(cropped_frame, (self.frame_size, self.frame_size))

        self.frame_index += 1

        return resized_frame

For that first working version I used a target frame size of 256px with a source frame size of 256×3 pixels: I should make this configurable instead.

⇒ ref: webcam_test_centered_v1.mp4

And also, for the interpolation I should use a windowed mean class.

Here is the WindowedMean class implementation:

"""WindowedMean class definition"""
from collections import deque


class WindowedMean:
    """
    Class for efficiently computing a windowed mean on the last N values from a measurement.

    Attributes:
        window_size (int): The size of the window for computing the mean.
        values (deque): A deque to store the last N measurements.
        sum (float): The sum of the values in the window.

    Methods:
        add_value(value): Adds a new value to the window and updates the mean.
        get_mean(): Calculates and returns the mean of the values in the window.
    """

    def __init__(self, window_size):
        """
        Initialize the WindowedMean object.

        Args:
            window_size (int): The size of the window for computing the mean.
        """
        self.window_size = window_size
        self.values = deque(maxlen=window_size)
        self.sum = 0.0

    def add_value(self, value):
        """
        Add a new value to the window and update the mean.

        Args:
            value: The new measurement value.
        """
        if len(self.values) == self.window_size:
            # Subtract the oldest value from the sum
            self.sum -= self.values[0]

        # Add the new value to the deque and the sum
        self.values.append(value)
        self.sum += value

    def get_mean(self):
        """
        Calculate and return the mean of the values in the window.

        Returns:
            float: The computed mean.
        """
        if not self.values:
            return None  # Return None if no values are available
        return self.sum / len(self.values)

And now using that for the face cx/cy interpolation:

        if self.target_face_cx is not None:
            self.face_cx.add_value(self.target_face_cx)
            self.face_cy.add_value(self.target_face_cy)

        # Define the region of interest (ROI) around the detected face
        hsize = (self.frame_size * 3) // 2

        fcx = self.face_cx.get_mean() or frame.shape[1] // 2
        fcy = self.face_cy.get_mean() or frame.shape[0] // 2

⇒ ref: webcam_test_centered_v2.mp4

I installed the required packages in my media_env environment

⇒ See this page form usage of the MTCNN component in facenet_pytorch package: https://www.kaggle.com/code/timesler/guide-to-mtcnn-in-facenet-pytorch/notebook

Updated the detect_face method:

    def detect_faces(self, frame):
        """Helper method to detect a face"""
        if self.face_detector is None:
            logger.info("Initializing MTCNN detector...")
            self.face_detector = MTCNN(
                device="cuda",
                select_largest=False,
                post_process=False,
            )
            # self.detect_face_func = capture_output(self.face_detector.detect_faces)
            # self.detect_face_func = capture_output(self.face_detector.detect)
            self.detect_face_func = self.face_detector.detect

        boxes, _ = self.detect_face_func(frame)
        # faces = self.detect_face_func(frame)

        # if faces:
        #     # Return only the first detected face
        #     return faces[0]["box"]
        if boxes is not None:
            return boxes[0]
        else:
            return None

And also the face coordinates computation:

            if face_coordinates is not None:
                # x, y, w, h = face_coordinates
                left, top, right, bottom = face_coordinates
                # center_x, center_y = x + w // 2, y + h // 2
                center_x, center_y = (left + right) / 2.0, (top + bottom) / 2.0

⇒ ref: webcam_test_centered_v3.mp4

This implementation works as well as the previous version and it is significantly faster (about 106it/sec vs 60it/secs)

Next point I want to cover is to try to cancel the current interpolation delay that we get due to the fact that we take 60 frames to get to the correct center position.

But to acheive that, we will need to perform the face detection/extraction in 2 passes.

OK, now preprocessing the frames to extract the face positions, and then using the interp1d function from scipy to perform the interpolation:

    def process_frame(self, frame):
        """Function to process each frame of the video"""
        if self.frame_index % self.face_window_len == 0:
            logger.info("Processing frame %d", self.frame_index)

        # Perform interpolation to get the face x/y position:
        fcx = self.interp_fcx_func(self.frame_index)
        fcy = self.interp_fcy_func(self.frame_index)

        # Define the region of interest (ROI) around the detected face
        hsize = (self.frame_size * 3) // 2

        cx = min(max(fcx, hsize), frame.shape[1] - hsize)
        cy = min(max(fcy, hsize), frame.shape[0] - hsize)

        roi_start_x = int(cx - hsize)
        roi_start_y = int(cy - hsize)
        roi_end_x = int(cx + hsize)
        roi_end_y = int(cy + hsize)

        # Crop and resize the video around the detected face
        cropped_frame = frame[roi_start_y:roi_end_y, roi_start_x:roi_end_x]
        resized_frame = cv2.resize(cropped_frame, (self.frame_size, self.frame_size))

        self.frame_index += 1

        return resized_frame

    def collect_face_position(self, frame, nframes):
        """Collect the face position for a given frame"""
        if self.frame_index % self.face_window_len == 0:
            logger.info("Collecting face at frame %d/%d", self.frame_index, nframes)
            face_coordinates = self.detect_faces(frame)

            if face_coordinates is not None:
                # x, y, w, h = face_coordinates
                left, top, right, bottom = face_coordinates
                # center_x, center_y = x + w // 2, y + h // 2
                center_x, center_y = (left + right) / 2.0, (top + bottom) / 2.0

                self.frame_indices.append(self.frame_index)
                self.face_pos_x.append(center_x)
                self.face_pos_y.append(center_y)

        self.frame_index += 1

        return frame

⇒ ref: webcam_test2_centered_v4_180f.mp4 and webcam_test2_centered_v4_60f.mp4 and webcam_test2_centered_v4_60f_linear.mp4

Note: I tested both “cubic” and “linear” interpolations and the linear version seems a bit more appropriate to me 🤔.

Next point I would like to cover is on the face scale control: currently we use a fixed scale of for instance 3 time the destination clip size. But I think it could be interesting to use instead a dynamic scaling process based on the size of the face on the source image. And we could then dynamically change the size of the source area depending on the face size.

Added smooth interpolation for the frame source size:

        # Perform interpolation to get the face x/y position:
        cx = self.interp_fcx_func(self.frame_index)
        cy = self.interp_fcy_func(self.frame_index)
        fsize = self.interp_fsize_func(self.frame_index)
        # logger.info("source frame size: %f", fsize)

        # Define the region of interest (ROI) around the detected face
        # hsize = (self.frame_size * 3) // 2

        self.current_fsize += (fsize - self.current_fsize) * 0.003

        hsize = self.current_fsize * 3.0 / 2.0
        hsize = min(hsize, cx, cy, frame.shape[1] - cx, frame.shape[0] - cy)

        # cx = min(max(fcx, hsize), frame.shape[1] - hsize)
        # cy = min(max(fcy, hsize), frame.shape[0] - hsize)

Note: I also tweaked a lot the frame_window_size, scaling adaptation factor, and source frame size factor, etc. Now I seem to have some appropriate settings ;-).

⇒ ref: webcam_test2_centered_v5.mp4

To write webm files we simply change the output file extension:

        output_path = self.set_path_extension(input_file, "_centered.webm")

And we change the codec:

        # Save the processed video
        # vcodec="libvpx-vp9"
        vcodec = "libvpx"
        processed_clip.write_videofile(
            output_path, audio=True, codec=vcodec, audio_codec="libvorbis", bitrate="700k", threads=4
        )

Note: libvpx-vp9 seems significantly slower than libvpx, so we stick to the later for now.

Arrgghh… to support transparency, we need to specify the pixel format for ffmpeg:

        processed_clip.write_videofile(
            output_path,
            audio=True,
            codec=vcodec,
            audio_codec="libvorbis",
            bitrate="700k",
            threads=4,
            ffmpeg_params=["-pix_fmt", "yuva420p"],
        )

But, unfortunately, this will then generate an error while writing the output with the libvpx codec:

MoviePy error: FFMPEG encountered the following error while writing file webcam_test2_centered.webm:

 b'[libvpx @ 000001e590991c00] Transparency encoding with auto_alt_ref does not work\r\nError initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height\r\n'

⇒ To fix this issue it seems we really need to use the libvpx-vp9 codec.

OK, this doesn't work so well at playback unfortunately. So trying another codec (cf. https://stackoverflow.com/questions/75455066/convert-transparent-pngs-to-a-transparent-video)

Note: I was generating a circular mask as follow:

    # Function to create a circular mask
    def make_circular_mask(self, size, center, radius):
        """Method to generate a circular mask"""
        mask = Image.new("L", size, 0)
        draw = ImageDraw.Draw(mask)
        draw.ellipse((center[0] - radius, center[1] - radius, center[0] + radius, center[1] + radius), fill=1)
        return np.array(mask).astype(np.uint8)

Tested the prores / prores_ks codecs, with the “yuva444p10le” pixel format: not working at playback

Also checked that page: https://curiosalon.github.io/blog/ffmpeg-alpha-masking/

⇒ In the end, nothing seems to work as expected here and I cannot generate a video with with transparency at playback 🤕.

  • blog/2024/0105_face_tracking_in_video.txt
  • Last modified: 2024/01/18 20:22
  • by 127.0.0.1