====== Quick project: filtering out corrupted gifs [python] ======
{{tag>dev python nervproj}}
Hi guys! So here is another "quick project" article, this time about filtering valid vs corrupted gif images from a given folder. The context is simple: I've been downloading a collection of gifs as a torrent, but unfortunately that download is stuck at 92% so I don't think I will ever get the full collection. Still some of the files in there are already completely downloaded, so technically I should be able to try to read each of those gif files, and then only keep the valid ones 😉
As usual now, I'm going to build this as a new python utility in my NervProj framework, let's get started!
====== ======
===== Preparing the initial version of the utility component =====
* Here is the initial minimal component class I created: """Module for Coin class definition"""
import logging
from PIL import Image
import PIL
from nvp.nvp_context import NVPContext
from nvp.nvp_component import NVPComponent
logger = logging.getLogger(__name__)
class GifHandler(NVPComponent):
"""Coin component class"""
def __init__(self, ctx: NVPContext):
"""class constructor"""
NVPComponent.__init__(self, ctx)
def process_command(self, cmd):
"""Check if this component can process the given command"""
if cmd == 'filter-valid':
return self.filter_valid_files()
return False
def filter_valid_files(self):
"""Filter the valid gif files from a given folder"""
# Should perform the filtering here.
return True
if __name__ == "__main__":
# Create the context:
context = NVPContext()
# Add our component:
comp = context.register_component("gifhandler", GifHandler(context))
context.define_subparsers("main", {
'filter-valid': None,
})
psr = context.get_parser('main.filter-valid')
psr.add_argument("--output", dest="output_dir", type=str,
help="Output dir where to store the valid files")
psr.add_argument("--input", dest="input_dir", type=str,
help="Input dir where to start the filtering")
comp.run()
* Then of course I defined a new script, and added a new dedicated python env for this kind of "media handling" tools: "custom_python_envs": {
"defi_env": {
"packages": ["requests", "jstyleson", "xxhash", "numpy", "psycopg2"]
},
"media_env": {
"packages": [
"requests",
"jstyleson",
"xxhash",
"moviepy",
"pillow",
"ffmpeg-python",
"opencv-python"
]
}
},
"scripts": {
// Update the coingecko prices
"coingecko": {
"custom_python_env": "defi_env",
"cmd": "${PYTHON} nvh/defi/coingecko.py",
"cwd": "${PROJECT_ROOT_DIR}",
"python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"]
},
"gifs": {
"custom_python_env": "media_env",
"cmd": "${PYTHON} ${PROJECT_ROOT_DIR}/nvh/media/gif_handler.py",
"python_path": ["${PROJECT_ROOT_DIR}", "${NVP_ROOT_DIR}"]
}
}
Note that I'm **not** changing the CWD for the "gifs" script above: I want to be able to run that script from inside the input folder I need to process actually.
===== Writting the main function =====
* Final step was to implement the ''filter_valid_files'' method correctly, which was in fact pretty straightforward: def filter_valid_files(self):
"""Filter the valid gif files from a given folder"""
input_dir = self.get_param("input_dir")
if input_dir is None:
# Use the current working dir:
input_dir = self.get_cwd()
output_dir = self.get_param("output_dir")
if output_dir is None:
# We use the parent folder of the input:
folder = self.get_filename(input_dir)
parent_dir = self.get_parent_folder(input_dir)
output_dir = self.get_path(parent_dir, f"{folder}_filtered")
# logger.info("Should filter the image from %s into %s", input_dir, output_dir)
# list all the gif files recursively
all_files = self.get_all_files(input_dir, exp="\.gif", recursive=True)
num_imgs = len(all_files)
logger.info("Collected %d gif files", num_imgs)
# Create the destination dir:
self.make_folder(output_dir)
valid_count = 0
# Iterate on each file:
for i in range(num_imgs):
fname = all_files[i]
src_file = self.get_path(input_dir, fname)
# Try to open that file:
try:
img = PIL.Image.open(src_file)
# Should have more that 1 frame:
nframes = getattr(img, 'n_frames', 1)
if nframes <= 1:
logger.info("%d/%d: Not enough frames in %s", i+1, num_imgs, fname)
continue
img.close()
valid_count += 1
# Should move the image here.
dest_file = self.get_path(output_dir, fname)
self.rename_file(src_file, dest_file, True)
except (PIL.UnidentifiedImageError, PIL.Image.DecompressionBombError):
logger.info("%d/%d: Cannot open file %s", i+1, num_imgs, fname)
continue
logger.info("Filtered %d valid images (ie %.3f%%)", valid_count, valid_count*100.0/num_imgs)
return True
* => And this is it already: I just need to move in the root folder with all the gif images, and I run the command: nvp run gifs filter-valid
* This will produce a sibling folder with the suffix "_filtered" containing all the valid gif files, while the invalid files will remain in the original folder 👍!
I didn't even bother using the command line arguments '--input' and '--output' I defined above: default behavior is OK for my usage
* => So for once, this was really a quick project, good good 😂