blog:2022:0510_nervproj_filtering_corrupted_gifs

Quick project: filtering out corrupted gifs [python]

Hi guys! So here is another “quick project” article, this time about filtering valid vs corrupted gif images from a given folder. The context is simple: I've been downloading a collection of gifs as a torrent, but unfortunately that download is stuck at 92% so I don't think I will ever get the full collection. Still some of the files in there are already completely downloaded, so technically I should be able to try to read each of those gif files, and then only keep the valid ones 😉

As usual now, I'm going to build this as a new python utility in my NervProj framework, let's get started!

  • Here is the initial minimal component class I created:
    """Module for Coin class definition"""
    
    import logging
    
    from PIL import Image
    import PIL
    
    from nvp.nvp_context import NVPContext
    from nvp.nvp_component import NVPComponent
    
    logger = logging.getLogger(__name__)
    
    
    class GifHandler(NVPComponent):
        """Coin component class"""
    
        def __init__(self, ctx: NVPContext):
            """class constructor"""
            NVPComponent.__init__(self, ctx)
    
        def process_command(self, cmd):
            """Check if this component can process the given command"""
    
            if cmd == 'filter-valid':
                return self.filter_valid_files()
    
            return False
    
        def filter_valid_files(self):
            """Filter the valid gif files from a given folder"""
    
            # Should perform the filtering here.
            return True
    
    if __name__ == "__main__":
        # Create the context:
        context = NVPContext()
    
        # Add our component:
        comp = context.register_component("gifhandler", GifHandler(context))
    
        context.define_subparsers("main", {
            'filter-valid': None,
        })
    
        psr = context.get_parser('main.filter-valid')
        psr.add_argument("--output", dest="output_dir", type=str,
                         help="Output dir where to store the valid files")
        psr.add_argument("--input", dest="input_dir", type=str,
                         help="Input dir where to start the filtering")
    
        comp.run()
    
  • Then of course I defined a new script, and added a new dedicated python env for this kind of “media handling” tools:
      "custom_python_envs": {
        "defi_env": {
          "packages": ["requests", "jstyleson", "xxhash", "numpy", "psycopg2"]
        },
        "media_env": {
          "packages": [
            "requests",
            "jstyleson",
            "xxhash",
            "moviepy",
            "pillow",
            "ffmpeg-python",
            "opencv-python"
          ]
        }
      },
    
      "scripts": {
        // Update the coingecko prices
        "coingecko": {
          "custom_python_env": "defi_env",
          "cmd": "${PYTHON} nvh/defi/coingecko.py",
          "cwd": "${PROJECT_ROOT_DIR}",
          "python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"]
        },
        "gifs": {
          "custom_python_env": "media_env",
          "cmd": "${PYTHON} ${PROJECT_ROOT_DIR}/nvh/media/gif_handler.py",
          "python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"]
        }
      }
Note that I'm not changing the CWD for the “gifs” script above: I want to be able to run that script from inside the input folder I need to process actually.
  • Final step was to implement the filter_valid_files method correctly, which was in fact pretty straightforward:
        def filter_valid_files(self):
            """Filter the valid gif files from a given folder"""
    
            input_dir = self.get_param("input_dir")
            if input_dir is None:
                # Use the current working dir:
                input_dir = self.get_cwd()
    
            output_dir = self.get_param("output_dir")
            if output_dir is None:
                # We use the parent folder of the input:
                folder = self.get_filename(input_dir)
                parent_dir = self.get_parent_folder(input_dir)
                output_dir = self.get_path(parent_dir, f"{folder}_filtered")
    
            # logger.info("Should filter the image from %s into %s", input_dir, output_dir)
            # list all the gif files recursively
            all_files = self.get_all_files(input_dir, exp="\.gif", recursive=True)
            num_imgs = len(all_files)
            logger.info("Collected %d gif files", num_imgs)
    
            # Create the destination dir:
            self.make_folder(output_dir)
    
            valid_count = 0
            # Iterate on each file:
            for i in range(num_imgs):
                fname = all_files[i]
                src_file = self.get_path(input_dir, fname)
                # Try to open that file:
                try:
                    img = PIL.Image.open(src_file)
                    # Should have more that 1 frame:
                    nframes = getattr(img, 'n_frames', 1)
                    if nframes <= 1:
                        logger.info("%d/%d: Not enough frames in %s", i+1, num_imgs, fname)
                        continue
                    img.close()
    
                    valid_count += 1
    
                    # Should move the image here.
                    dest_file = self.get_path(output_dir, fname)
                    self.rename_file(src_file, dest_file, True)
    
                except (PIL.UnidentifiedImageError, PIL.Image.DecompressionBombError):
                    logger.info("%d/%d: Cannot open file %s", i+1, num_imgs, fname)
                    continue
    
            logger.info("Filtered %d valid images (ie %.3f%%)", valid_count, valid_count*100.0/num_imgs)
    
            return True
  • ⇒ And this is it already: I just need to move in the root folder with all the gif images, and I run the command:
    nvp run gifs filter-valid
  • This will produce a sibling folder with the suffix “_filtered” containing all the valid gif files, while the invalid files will remain in the original folder 👍!
I didn't even bother using the command line arguments '--input' and '--output' I defined above: default behavior is OK for my usage
  • ⇒ So for once, this was really a quick project, good good 😂
  • blog/2022/0510_nervproj_filtering_corrupted_gifs.txt
  • Last modified: 2022/05/10 20:02
  • by 127.0.0.1