====== Quick project: generating multiwords passwords [python] ====== {{tag>dev python crypto}} Hi guys! If you have touched the crypto/blockchain field already, then you know that very often the passwords generated to protect your wallets, are built as a "collection of words" (english words is the only thing i've seen so far myself). Like for instance: "sleep score my opposite migrate", really just random words. I think this concept is quite interesting as a generic mechanism to create your passwords: they are still very strong, but it's easier to remember/use a few words than a completely random collection of letters/digits/symbols 😊 specially when those words are in your mother language (so french for me) => So today, let's just build a python utility to do just that: generate this kind of multi words password ====== ====== ===== The command line arguments ===== The way I see it, I could use the following command line arguments for that tool: * **language**: A language code name to specify the language from which the words should be taken: could be english or french for instance, should default to french in my case maybe. * **num_words**: Number of words to pick * **min_len**: Minimum number of characters in the words, default to 3 * **max_len**: Maximum number of characters in the words, default to 8 * **no_space**: Remove the space between the words, should default to false And that should be enough to generate our passwords with a command line such as: nvp gen-password Let's prepare the skeleton of as a new NVPComponent: """PasswordGenerator module""" import logging from nvp.nvp_component import NVPComponent from nvp.nvp_context import NVPContext logger = logging.getLogger(__name__) class PasswordGenerator(NVPComponent): """PasswordGenerator component class""" def __init__(self, ctx: NVPContext, _proj=None): """Component constructor""" NVPComponent.__init__(self, ctx) desc = { "gen-password": None, } ctx.define_subparsers("main", desc) psr = ctx.get_parser('main.gen-password') psr.add_argument("-l", "--language", dest="language", type=str, default="fr", help="Input language to use to collect the words") psr.add_argument("-n", "--num", dest="num_words", type=int, default=5, help="Number of words to collect") psr.add_argument("--min", dest="min_len", type=int, default=3, help="Minimum number of characters in the words") psr.add_argument("--max", dest="max_len", type=int, default=8, help="Maximum number of characters in the words") psr.add_argument("-p", "--no-space", dest="no_space", action='store_true', help="Remove the space between the words") def process_command(self, cmd0): """Re-implementation of process_command""" if cmd0 == 'gen-password': return self.run() return False def run(self): """Generate an password given some input settings""" logger.info("Should generate password here.") return True And this will give use the expected output when the command is run: kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 07:22:44 [components.password_generator] INFO: Should generate password here. ===== Retrieving collection of words ===== Okay, now that we can run the command, let's start thinking about the actual process: * We will need to select words from a collection of words randomly. This should be easily done simply with [[https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html|numpy.random.choice()]] * But first, we need a **list of words** to work with! And that, I don't have yet... So this requires some additional thinking/searching. * Once we have a list of words in the target language we can easily pre-process that list to consider only the words with the correct number of characters, and that part is easy too. So, let's find some words first 😊! Okay, so, that was not too hard: we have for instance https://www.gutenberg.org/browse/languages/fr where we can download raw text versions of french books, so that is what I just looked at. And in fact on that page, you have a very large list of links in the format **https://www.gutenberg.org/ebooks/18812**, and for each of those links, we can then download a text file at (for instance) **https://www.gutenberg.org/ebooks/18812.txt.utf-8** (or maybe not ?) => So let's try to grab all those links, and them some of the text files! We will add a new command in our PasswordGenerator module for that called ''collect-words'': def collect_words(self): """Collecting words from text files.""" # logger.info("Should collect text files here.") lang = self.get_param("language") # Get the text file links from gutenberg: url = f"https://www.gutenberg.org/browse/languages/{lang}" content = self.get_online_content(url) # Extract the urls: # https://www.gutenberg.org/ebooks/18812 book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)", content) for elem in book_desc: logger.info("%s : '%s'", elem[0], elem[1]) logger.info("Found %d books", len(book_desc)) return True And with that code, I can find more that 4000 french books, so that should be far more than enough 😅: 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8561 : 'Une page d'amour' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/34451 : 'Paris' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8907 : 'Pot-Bouille' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/17533 : 'Le Rêve' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/34528 : 'Rome' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/17557 : 'Son Excellence Eugène Rougon' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8563 : 'La Terre' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/7461 : 'Thérèse Raquin' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/6470 : 'Le Ventre de Paris' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/56808 : 'La vérité en marche: L'affaire Dreyfus' 2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/46447 : 'Ma confession' 2022/04/25 08:03:21 [components.password_generator] INFO: Found 4611 books => For now let's just retrieve a few of those books, let's say 20, randomly selected. In fact this is giving me another idea: since I can collect so much text data in french, this could be used an an input to a deep learning network to write texts in french automatically ? => To be investigated further some day if I have the time for it ===== Removing duplicate books and downloading ===== One step further now with the following version of the **collect_words** method where I handle the duplicated books from the initial list (so we only have 3400 books now boooooo!! 😂), and then downloading a few of them in a local folder: def collect_words(self): """Collecting words from text files.""" # logger.info("Should collect text files here.") lang = self.get_param("language") # Get the text file links from gutenberg: url = f"https://www.gutenberg.org/browse/languages/{lang}" content = self.get_online_content(url) # Extract the urls: # https://www.gutenberg.org/ebooks/18812 book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)", content) book_urls = set() title_map = {} titles = set() for elem in book_desc: # logger.info("%s : '%s'", elem[0], elem[1]) if elem[0] in book_urls: continue book_urls.add(elem[0]) title = self.sanitize_title(elem[1]) base_title = title idx = 1 while title in titles: idx += 1 title = f"{base_title}-{idx}" logger.info("Using alternate title: %s", title) titles.add(title) title_map[elem[0]] = title book_urls = list(book_urls) nbooks = len(book_urls) logger.info("Found %d books", nbooks) # download count max_num_books = self.get_param("num_books") count = 0 if max_num_books == 0: max_num_books = nbooks random.shuffle(book_urls) tools = self.get_component("tools") data_dir = self.get_data_dir() dest_dir = self.get_path(data_dir, f"input_{lang}") self.make_folder(dest_dir) for i in range(nbooks): # try to download the book book_id = book_urls[i] url = f"https://www.gutenberg.org{book_id}.txt.utf-8" title = self.sanitize_title(title_map[book_id]) dest_file = self.get_path(dest_dir, f"{title}.txt") # File should not exist already: if self.file_exists(dest_file): count += 1 elif tools.download_file(url, dest_file, f"{count}/{max_num_books} "): count += 1 if count >= max_num_books: break return True Still working just fine so far: kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp collect-words 2022/04/25 08:31:05 [nvp.nvp_object] INFO: Sending request on https://www.gutenberg.org/browse/languages/fr... 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lettres-de-mon-moulin-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: la-tulipe-noire-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-mille-et-une-nuits-tome-premier-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: emaux-et-camees-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lodyssee-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lholocauste-roman-contemporain-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lexilee-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: la-fille-du-capitaine-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-petites-filles-modeles-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-voyages-de-gulliver-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: autour-de-la-lune-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-3 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-3 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-4 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-tribulations-dun-chinois-en-chine-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: candide-ou-loptimisme-2 2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: salome-2 2022/04/25 08:31:06 [components.password_generator] INFO: Found 3440 books 2022/04/25 08:31:06 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/38335.txt.utf-8... 0/5 [==================================================] 6396/6396 100.000% 2022/04/25 08:31:07 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/42036.txt.utf-8... 1/5 [==================================================] 6396/6396 100.000% 2022/04/25 08:31:07 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/29282.txt.utf-8... 2/5 [==================================================] 283265/283265 100.000% 2022/04/25 08:31:09 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/30788.txt.utf-8... 3/5 [==================================================] 1224157/1224157 100.000% 2022/04/25 08:31:13 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/46541.txt.utf-8... 4/5 [==================================================] 6396/6396 100.000% ===== Handling invalid urls and too long titles ===== Then I realized that some of the text files download actually contained HTML code with indication that the URL was not correct: apparently there are at least 2 differents URL schemes to download the text files so I added support for that, trying multiple urls for each download. And also eventually got this error: 2022/04/25 08:45:54 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/57788.txt.utf-8... Traceback (most recent call last): File "D:\Projects\NervProj\cli.py", line 5, in ctx.run() File "D:\Projects\NervProj\nvp\nvp_context.py", line 291, in run if comp.process_command(cmd): File "D:\Projects\NervHome\components\password_generator.py", line 61, in process_command return self.collect_words() File "D:\Projects\NervHome\components\password_generator.py", line 134, in collect_words if tools.download_file(url, dest_file, f"{count}/{max_num_books} "): File "D:\Projects\NervProj\nvp\components\tools.py", line 258, in download_file with open(tmp_file, "wb") as fdd: FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Projects\\NervHome\\data\\words\\books_fr\\avis-pour-les-religieuses-de-lord re-de-lannonciade-celeste-fonde-a-genes-lannee-de-notre-salut-1604-brrimprimes-en-ladite-ville-amp-accomodes-a-la-pratique-de-lobservance -des-constitutions-pour-linstruction-des-exercices-spirituels-a-lusage-des-monasteres-du-meme-ordre.txt.download' So obviously, I need to protect myself against way too long title/filenames 😁! ===== Collecting the words ===== And then I could finally start collecting some words from those text files, ensuring I would only consider "valid" words for the selected language. Which give us the following code for **collect_words** (note that I moved the first part of our work above in the method **download_books**) def collect_words(self): """Collecting words from text files.""" chars = { "fr": "abcdefghijklmnopqrstuvwxyzâàæçéêëèïîôûùüÿœ", "en": "abcdefghijklmnopqrstuvwxyz" } lang = self.get_param("language") data_dir = self.get_data_dir() dest_dir = self.get_path(data_dir, f"books_{lang}") # Now we get the content of each book: books = self.get_all_files(dest_dir, exp="\.txt") if len(books) == 0: logger.info("No book downloaded yet.") return True words = set() allowed_chars = chars[lang] for book in books: # logger.info("Should read book: %s", book) content = self.read_text_file(self.get_path(dest_dir, book)) all_words = content.split(" ") logger.info("Processing %d words book %s...", len(all_words), book) added = 0 # process each word: for word in all_words: word = self.sanitize_word(word, allowed_chars) if word is not None and word not in words: words.add(word) added += 1 logger.info("Added %d new words", added) # write the list of words: dest_file = self.get_path(data_dir, f"words_{lang}.txt") words = list(words) words.sort() logger.info("Writting %d words for language %s", len(words), lang) self.write_text_file("\n".join(words), dest_file) return True def sanitize_word(self, word, allowed): """Sanitize a given word for our current language of interest.""" if not word.isalpha(): return None word = word.lower() for char in word: if char not in allowed: return None return word.lower() And with that, I will collect about 61k unique words from 50 french books already: 2022/04/25 09:24:57 [components.password_generator] INFO: Processing 60477 words book suzanne-et-le-pacifique.txt... 2022/04/25 09:24:57 [components.password_generator] INFO: Added 828 new words 2022/04/25 09:24:57 [components.password_generator] INFO: Processing 54861 words book suzanne-normis-roman-dun-pere.txt... 2022/04/25 09:24:57 [components.password_generator] INFO: Added 229 new words 2022/04/25 09:24:57 [components.password_generator] INFO: Processing 69463 words book un-hollandais-a-paris-en-1891-sensations-de-littera ture-et-dart.txt... 2022/04/25 09:24:57 [components.password_generator] INFO: Added 407 new words 2022/04/25 09:24:57 [components.password_generator] INFO: Processing 111405 words book vercingetorix.txt... 2022/04/25 09:24:57 [components.password_generator] INFO: Added 1082 new words 2022/04/25 09:24:57 [components.password_generator] INFO: Processing 157230 words book vie-privee-et-publique-des-animaux.txt... 2022/04/25 09:24:58 [components.password_generator] INFO: Added 1079 new words 2022/04/25 09:24:58 [components.password_generator] INFO: Writting 61269 words for language fr Which is certainly not bad at all already, but just for the fun I'm going to add a few more books to the input list 😋 ===== Remaining error on utf-8 ===== So when trying to download more books I eventually got this other error: 2022/04/25 09:42:39 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/54035/54035-0.txt... 216/500 [==================================================] 191553/191553 100.000% Traceback (most recent call last): File "D:\Projects\NervProj\cli.py", line 5, in ctx.run() File "D:\Projects\NervProj\nvp\nvp_context.py", line 291, in run if comp.process_command(cmd): File "D:\Projects\NervHome\components\password_generator.py", line 66, in process_command return self.download_books() File "D:\Projects\NervHome\components\password_generator.py", line 147, in download_books content = self.read_text_file(dest_file) File "D:\Projects\NervProj\nvp\nvp_object.py", line 242, in read_text_file content = file.read() File "D:\Projects\NervProj\tools\windows\python-3.10.1\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 172089: invalid start byte => I need to improve the robustness when downloading then reading a text file: for url in urls: if tools.download_file(url, dest_file, f"{count+1}/{max_num_books} "): # Check if the content of the file is not html: content = None try: content = self.read_text_file(dest_file) except UnicodeDecodeError: logger.error("Invalid unicode character in %s", dest_file) if content is None or content.startswith(""): # Not what we want, discard that file: logger.info("Invalid content at %s, discarding it.", url) self.remove_file(dest_file) else: break **OK**, and now processing 500 french books, I get a collection of 182k words 👍: 2022/04/25 10:16:19 [components.password_generator] INFO: Processing voyage-en-orient-volume-2-les-nuits-du-ramazan-de-paris-a-cythere-lo rely.txt... 2022/04/25 10:16:20 [components.password_generator] INFO: => Added 180 new words from 168709 source elements 2022/04/25 10:16:20 [components.password_generator] INFO: Processing voyages-du-capitaine-lemuel-gulliver-en-divers-pays-eloignes-tome-i- de-iii.txt... 2022/04/25 10:16:20 [components.password_generator] INFO: => Added 444 new words from 50923 source elements 2022/04/25 10:16:20 [components.password_generator] INFO: Processing voyages-imaginaires-songes-visions-et-romans-cabalistiques-tome-35.t xt... 2022/04/25 10:16:20 [components.password_generator] INFO: => Added 152 new words from 83807 source elements 2022/04/25 10:16:20 [components.password_generator] INFO: Processing vue-generale-de-lhistoire-politique-de-leurope.txt... 2022/04/25 10:16:20 [components.password_generator] INFO: => Added 36 new words from 34304 source elements 2022/04/25 10:16:20 [components.password_generator] INFO: Writting 182560 words for language fr => Many of those words are probably not french in fact... could be town or people names: So maybe I should ignore words starting containing a capital letter: let's do that: def sanitize_word(self, word, allowed): """Sanitize a given word for our current language of interest.""" if not word.isalpha(): return None # Ignore word with capital letter: if any(ele.isupper() for ele in word): return None # Should not need to convert to lower case: word = word.lower() for char in word: if char not in allowed: return None return word.lower() With that change we still get more than 147k words: 2022/04/25 10:21:43 [components.password_generator] INFO: Processing voyage-en-espagne.txt... 2022/04/25 10:21:43 [components.password_generator] INFO: => Added 176 new words from 114676 source elements 2022/04/25 10:21:43 [components.password_generator] INFO: Processing voyage-en-orient-volume-2-les-nuits-du-ramazan-de-paris-a-cythere-lo rely.txt... 2022/04/25 10:21:43 [components.password_generator] INFO: => Added 109 new words from 168709 source elements 2022/04/25 10:21:43 [components.password_generator] INFO: Processing voyages-du-capitaine-lemuel-gulliver-en-divers-pays-eloignes-tome-i- de-iii.txt... 2022/04/25 10:21:44 [components.password_generator] INFO: => Added 396 new words from 50923 source elements 2022/04/25 10:21:44 [components.password_generator] INFO: Processing voyages-imaginaires-songes-visions-et-romans-cabalistiques-tome-35.t xt... 2022/04/25 10:21:44 [components.password_generator] INFO: => Added 133 new words from 83807 source elements 2022/04/25 10:21:44 [components.password_generator] INFO: Processing vue-generale-de-lhistoire-politique-de-leurope.txt... 2022/04/25 10:21:44 [components.password_generator] INFO: => Added 29 new words from 34304 source elements 2022/04/25 10:21:44 [components.password_generator] INFO: Writting 147523 words for language fr => Let's try to use that now anyway ;-)! ===== Generating the password ===== Finally, here is the method implementation to generate the actuall multi-word password: def gen_password(self): """Generate an password given some input settings""" # logger.info("Should generate password here.") # Read the word file: lang = self.get_param("language") data_dir = self.get_data_dir() dest_file = self.get_path(data_dir, f"words_{lang}.txt") all_words = self.read_text_file(dest_file).splitlines() minl = self.get_param("min_len") maxl = self.get_param("max_len") kept_words = [word for word in all_words if len(word) >= minl and len(word) <= maxl] logger.info("Keeping %d / %d words", len(kept_words), len(all_words)) num = self.get_param("num_words") # pick the random words: words = np.random.choice(kept_words, size=num) no_space = self.get_param("no_space") spacer = "" if no_space else " " password = spacer.join(words) logger.info("Generated password: \"%s\"", password) return True And this is working great! 🤪: kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:24:50 [components.password_generator] INFO: Keeping 79137 / 147523 words 2022/04/25 10:24:50 [components.password_generator] INFO: Generated password: "surgeons parton jouaient quière issit" kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:25:02 [components.password_generator] INFO: Keeping 79137 / 147523 words 2022/04/25 10:25:02 [components.password_generator] INFO: Generated password: "cabat voilette logeroit épaissit allicere" kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:25:16 [components.password_generator] INFO: Keeping 79137 / 147523 words 2022/04/25 10:25:16 [components.password_generator] INFO: Generated password: "écrirait pleynte movent escria hérissé" ===== Final code ===== And for reference, here is the final code for the complete component in case someone is interested: """PasswordGenerator module""" import logging import re import random import numpy as np from nvp.nvp_component import NVPComponent from nvp.nvp_context import NVPContext logger = logging.getLogger(__name__) class PasswordGenerator(NVPComponent): """PasswordGenerator component class""" def __init__(self, ctx: NVPContext, proj=None): """Component constructor""" NVPComponent.__init__(self, ctx) # Store the project self.proj = proj self.data_dir = None desc = { "gen-password": None, "download-books": None, "collect-words": None } ctx.define_subparsers("main", desc) psr = ctx.get_parser('main.collect-words') psr.add_argument("-l", "--language", dest="language", type=str, default="fr", help="language to collect the words") psr = ctx.get_parser('main.download-books') psr.add_argument("-l", "--language", dest="language", type=str, default="fr", help="language to collect the words") psr.add_argument("-n", "--num", dest="num_books", type=int, default=500, help="Number of books to collect to get the words") psr = ctx.get_parser('main.gen-password') psr.add_argument("-l", "--language", dest="language", type=str, default="fr", help="Input language to use to collect the words") psr.add_argument("-n", "--num", dest="num_words", type=int, default=5, help="Number of words to collect") psr.add_argument("--min", dest="min_len", type=int, default=3, help="Minimum number of characters in the words") psr.add_argument("--max", dest="max_len", type=int, default=8, help="Maximum number of characters in the words") psr.add_argument("-p", "--no-space", dest="no_space", action='store_true', help="Remove the space between the words") def get_data_dir(self): """Get the data directory for the words""" if self.data_dir is None: self.data_dir = self.get_path(self.proj.get_root_dir(), "data", "words") return self.data_dir def process_command(self, cmd0): """Re-implementation of process_command""" if cmd0 == 'gen-password': return self.gen_password() if cmd0 == 'download-books': return self.download_books() if cmd0 == 'collect-words': return self.collect_words() return False def download_books(self): """Download the books""" lang = self.get_param("language") data_dir = self.get_data_dir() dest_dir = self.get_path(data_dir, f"books_{lang}") # logger.info("Should collect text files here.") # Get the text file links from gutenberg: url = f"https://www.gutenberg.org/browse/languages/{lang}" content = self.get_online_content(url) # Extract the urls: # https://www.gutenberg.org/ebooks/18812 book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)", content) book_urls = set() title_map = {} titles = set() for elem in book_desc: # logger.info("%s : '%s'", elem[0], elem[1]) if elem[0] in book_urls: continue book_urls.add(elem[0]) title = self.sanitize_title(elem[1]) base_title = title idx = 1 while title in titles: idx += 1 title = f"{base_title}-{idx}" logger.info("Using alternate title: %s", title) titles.add(title) title_map[elem[0]] = title book_urls = list(book_urls) nbooks = len(book_urls) logger.info("Found %d books", nbooks) # download count max_num_books = self.get_param("num_books") count = 0 if max_num_books == 0: max_num_books = nbooks random.shuffle(book_urls) tools = self.get_component("tools") self.make_folder(dest_dir) for i in range(nbooks): # try to download the book book_id = book_urls[i] url1 = f"https://www.gutenberg.org{book_id}.txt.utf-8" book_num = book_id[8:] # discarding the /ebooks/ prefix url2 = f"https://www.gutenberg.org/{book_num}/{book_num}-0.txt" urls = [url1, url2] title = self.sanitize_title(title_map[book_id]) dest_file = self.get_path(dest_dir, f"{title}.txt") # File should not exist already: if self.file_exists(dest_file): count += 1 else: for url in urls: if tools.download_file(url, dest_file, f"{count+1}/{max_num_books} "): # Check if the content of the file is not html: content = None try: content = self.read_text_file(dest_file) except UnicodeDecodeError: logger.error("Invalid unicode character in %s", dest_file) if content is None or content.startswith(""): # Not what we want, discard that file: logger.info("Invalid content at %s, discarding it.", url) self.remove_file(dest_file) else: break if not self.file_exists(dest_file): logger.info("Could not download book %s from known urls.", book_num) else: count += 1 if count >= max_num_books: break return True def sanitize_title(self, title): """Replace all characters in title that should not be used in file name""" slug = self.slugify(title) slug = slug[:100] return slug def collect_words(self): """Collecting words from text files.""" chars = { "fr": "abcdefghijklmnopqrstuvwxyzâàæçéêëèïîôûùüÿœ", "en": "abcdefghijklmnopqrstuvwxyz" } lang = self.get_param("language") data_dir = self.get_data_dir() dest_dir = self.get_path(data_dir, f"books_{lang}") # Now we get the content of each book: books = self.get_all_files(dest_dir, exp="\.txt") if len(books) == 0: logger.info("No book downloaded yet.") return True words = set() allowed_chars = chars[lang] for book in books: # logger.info("Should read book: %s", book) content = self.read_text_file(self.get_path(dest_dir, book)) all_words = content.split(" ") logger.info("Processing %s...", book) added = 0 # process each word: for word in all_words: word = self.sanitize_word(word, allowed_chars) if word is not None and word not in words: words.add(word) added += 1 logger.info("=> Added %d new words from %d source elements", added, len(all_words)) # write the list of words: dest_file = self.get_path(data_dir, f"words_{lang}.txt") words = list(words) words.sort() logger.info("Writting %d words for language %s", len(words), lang) self.write_text_file("\n".join(words), dest_file) return True def sanitize_word(self, word, allowed): """Sanitize a given word for our current language of interest.""" if not word.isalpha(): return None # Ignore word with capital letter: if any(ele.isupper() for ele in word): return None # Should not need to convert to lower case: word = word.lower() for char in word: if char not in allowed: return None return word.lower() def gen_password(self): """Generate an password given some input settings""" # logger.info("Should generate password here.") # Read the word file: lang = self.get_param("language") data_dir = self.get_data_dir() dest_file = self.get_path(data_dir, f"words_{lang}.txt") all_words = self.read_text_file(dest_file).splitlines() minl = self.get_param("min_len") maxl = self.get_param("max_len") kept_words = [word for word in all_words if len(word) >= minl and len(word) <= maxl] logger.info("Keeping %d / %d words", len(kept_words), len(all_words)) num = self.get_param("num_words") # pick the random words: words = np.random.choice(kept_words, size=num) no_space = self.get_param("no_space") spacer = "" if no_space else " " password = spacer.join(words) logger.info("Generated password: \"%s\"", password) return True And this is it: in the end some of the words you get seem almost like they are from a foreign language in fact, but hey I think that's good enough for a first version: I can still request more words that what I need and only pick the ones I know right 😅 ? Or maybe I could simply start with ignoring words with accents for instance 🤔, let's see... only 105664 words found this way, and indeed, the results seem a bit closer to "usable" french that way: kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:31:17 [components.password_generator] INFO: Keeping 58496 / 105664 words 2022/04/25 10:31:17 [components.password_generator] INFO: Generated password: "estoyent hors crussiez apparatu desquex" kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:31:25 [components.password_generator] INFO: Keeping 58496 / 105664 words 2022/04/25 10:31:25 [components.password_generator] INFO: Generated password: "audias mouettes dorsaux piqueur chicaner" kenshin@Saturn /cygdrive/d/Projects/NervHome $ nvp gen-password 2022/04/25 10:31:32 [components.password_generator] INFO: Keeping 58496 / 105664 words 2022/04/25 10:31:32 [components.password_generator] INFO: Generated password: "amorties ramez forgeaf mesamer partire" => **Anyway**, now I'm out of this "quick project" ;-) took me a couple of hours which is too long already 😁! See yaa ✌ **Update**: I just realized that for english books, we cannot retrieve the list of book from a single page: instead we would have to process by author name on multiple pages, and only select the books with "(English)" after the title: but that would be a somewhat non trival task, so I won't do that right now.