====== Quick project: filtering out corrupted gifs [python] ====== {{tag>dev python nervproj}} Hi guys! So here is another "quick project" article, this time about filtering valid vs corrupted gif images from a given folder. The context is simple: I've been downloading a collection of gifs as a torrent, but unfortunately that download is stuck at 92% so I don't think I will ever get the full collection. Still some of the files in there are already completely downloaded, so technically I should be able to try to read each of those gif files, and then only keep the valid ones 😉 As usual now, I'm going to build this as a new python utility in my NervProj framework, let's get started! ====== ====== ===== Preparing the initial version of the utility component ===== * Here is the initial minimal component class I created: """Module for Coin class definition""" import logging from PIL import Image import PIL from nvp.nvp_context import NVPContext from nvp.nvp_component import NVPComponent logger = logging.getLogger(__name__) class GifHandler(NVPComponent): """Coin component class""" def __init__(self, ctx: NVPContext): """class constructor""" NVPComponent.__init__(self, ctx) def process_command(self, cmd): """Check if this component can process the given command""" if cmd == 'filter-valid': return self.filter_valid_files() return False def filter_valid_files(self): """Filter the valid gif files from a given folder""" # Should perform the filtering here. return True if __name__ == "__main__": # Create the context: context = NVPContext() # Add our component: comp = context.register_component("gifhandler", GifHandler(context)) context.define_subparsers("main", { 'filter-valid': None, }) psr = context.get_parser('main.filter-valid') psr.add_argument("--output", dest="output_dir", type=str, help="Output dir where to store the valid files") psr.add_argument("--input", dest="input_dir", type=str, help="Input dir where to start the filtering") comp.run() * Then of course I defined a new script, and added a new dedicated python env for this kind of "media handling" tools: "custom_python_envs": { "defi_env": { "packages": ["requests", "jstyleson", "xxhash", "numpy", "psycopg2"] }, "media_env": { "packages": [ "requests", "jstyleson", "xxhash", "moviepy", "pillow", "ffmpeg-python", "opencv-python" ] } }, "scripts": { // Update the coingecko prices "coingecko": { "custom_python_env": "defi_env", "cmd": "${PYTHON} nvh/defi/coingecko.py", "cwd": "${PROJECT_ROOT_DIR}", "python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"] }, "gifs": { "custom_python_env": "media_env", "cmd": "${PYTHON} ${PROJECT_ROOT_DIR}/nvh/media/gif_handler.py", "python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"] } } Note that I'm **not** changing the CWD for the "gifs" script above: I want to be able to run that script from inside the input folder I need to process actually. ===== Writting the main function ===== * Final step was to implement the ''filter_valid_files'' method correctly, which was in fact pretty straightforward: def filter_valid_files(self): """Filter the valid gif files from a given folder""" input_dir = self.get_param("input_dir") if input_dir is None: # Use the current working dir: input_dir = self.get_cwd() output_dir = self.get_param("output_dir") if output_dir is None: # We use the parent folder of the input: folder = self.get_filename(input_dir) parent_dir = self.get_parent_folder(input_dir) output_dir = self.get_path(parent_dir, f"{folder}_filtered") # logger.info("Should filter the image from %s into %s", input_dir, output_dir) # list all the gif files recursively all_files = self.get_all_files(input_dir, exp="\.gif", recursive=True) num_imgs = len(all_files) logger.info("Collected %d gif files", num_imgs) # Create the destination dir: self.make_folder(output_dir) valid_count = 0 # Iterate on each file: for i in range(num_imgs): fname = all_files[i] src_file = self.get_path(input_dir, fname) # Try to open that file: try: img = PIL.Image.open(src_file) # Should have more that 1 frame: nframes = getattr(img, 'n_frames', 1) if nframes <= 1: logger.info("%d/%d: Not enough frames in %s", i+1, num_imgs, fname) continue img.close() valid_count += 1 # Should move the image here. dest_file = self.get_path(output_dir, fname) self.rename_file(src_file, dest_file, True) except (PIL.UnidentifiedImageError, PIL.Image.DecompressionBombError): logger.info("%d/%d: Cannot open file %s", i+1, num_imgs, fname) continue logger.info("Filtered %d valid images (ie %.3f%%)", valid_count, valid_count*100.0/num_imgs) return True * => And this is it already: I just need to move in the root folder with all the gif images, and I run the command: nvp run gifs filter-valid * This will produce a sibling folder with the suffix "_filtered" containing all the valid gif files, while the invalid files will remain in the original folder 👍! I didn't even bother using the command line arguments '--input' and '--output' I defined above: default behavior is OK for my usage * => So for once, this was really a quick project, good good 😂