blog:2022:0425_password_generation

Quick project: generating multiwords passwords [python]

Hi guys! If you have touched the crypto/blockchain field already, then you know that very often the passwords generated to protect your wallets, are built as a “collection of words” (english words is the only thing i've seen so far myself). Like for instance: “sleep score my opposite migrate”, really just random words.

I think this concept is quite interesting as a generic mechanism to create your passwords: they are still very strong, but it's easier to remember/use a few words than a completely random collection of letters/digits/symbols 😊 specially when those words are in your mother language (so french for me)

⇒ So today, let's just build a python utility to do just that: generate this kind of multi words password

The way I see it, I could use the following command line arguments for that tool:

  • language: A language code name to specify the language from which the words should be taken: could be english or french for instance, should default to french in my case maybe.
  • num_words: Number of words to pick
  • min_len: Minimum number of characters in the words, default to 3
  • max_len: Maximum number of characters in the words, default to 8
  • no_space: Remove the space between the words, should default to false

And that should be enough to generate our passwords with a command line such as:

nvp gen-password

Let's prepare the skeleton of as a new NVPComponent:

"""PasswordGenerator module"""
import logging

from nvp.nvp_component import NVPComponent
from nvp.nvp_context import NVPContext

logger = logging.getLogger(__name__)


class PasswordGenerator(NVPComponent):
    """PasswordGenerator component class"""

    def __init__(self, ctx: NVPContext, _proj=None):
        """Component constructor"""
        NVPComponent.__init__(self, ctx)

        desc = {
            "gen-password": None,
        }
        ctx.define_subparsers("main", desc)

        psr = ctx.get_parser('main.gen-password')
        psr.add_argument("-l", "--language", dest="language", type=str, default="fr",
                         help="Input language to use to collect the words")
        psr.add_argument("-n", "--num", dest="num_words", type=int, default=5,
                         help="Number of words to collect")
        psr.add_argument("--min", dest="min_len", type=int, default=3,
                         help="Minimum number of characters in the words")
        psr.add_argument("--max", dest="max_len", type=int, default=8,
                         help="Maximum number of characters in the words")
        psr.add_argument("-p", "--no-space", dest="no_space", action='store_true',
                         help="Remove the space between the words")

    def process_command(self, cmd0):
        """Re-implementation of process_command"""

        if cmd0 == 'gen-password':
            return self.run()

        return False

    def run(self):
        """Generate an password given some input settings"""

        logger.info("Should generate password here.")
        return True

And this will give use the expected output when the command is run:

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 07:22:44 [components.password_generator] INFO: Should generate password here.

Okay, now that we can run the command, let's start thinking about the actual process:

  • We will need to select words from a collection of words randomly. This should be easily done simply with numpy.random.choice()
  • But first, we need a list of words to work with! And that, I don't have yet… So this requires some additional thinking/searching.
  • Once we have a list of words in the target language we can easily pre-process that list to consider only the words with the correct number of characters, and that part is easy too.

So, let's find some words first 😊!

Okay, so, that was not too hard: we have for instance https://www.gutenberg.org/browse/languages/fr where we can download raw text versions of french books, so that is what I just looked at.

And in fact on that page, you have a very large list of links in the format https://www.gutenberg.org/ebooks/18812, and for each of those links, we can then download a text file at (for instance) https://www.gutenberg.org/ebooks/18812.txt.utf-8 (or maybe not ?) ⇒ So let's try to grab all those links, and them some of the text files!

We will add a new command in our PasswordGenerator module for that called collect-words:

    def collect_words(self):
        """Collecting words from text files."""

        # logger.info("Should collect text files here.")
        lang = self.get_param("language")

        # Get the text file links from gutenberg:
        url = f"https://www.gutenberg.org/browse/languages/{lang}"

        content = self.get_online_content(url)

        # Extract the urls:
        # https://www.gutenberg.org/ebooks/18812
        book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)</a>", content)

        for elem in book_desc:
            logger.info("%s : '%s'", elem[0], elem[1])

        logger.info("Found %d books", len(book_desc))

        return True

And with that code, I can find more that 4000 french books, so that should be far more than enough 😅:

2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8561 : 'Une page d'amour'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/34451 : 'Paris'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8907 : 'Pot-Bouille'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/17533 : 'Le Rêve'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/34528 : 'Rome'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/17557 : 'Son Excellence Eugène Rougon'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/8563 : 'La Terre'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/7461 : 'Thérèse Raquin'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/6470 : 'Le Ventre de Paris'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/56808 : 'La vérité en marche: L'affaire Dreyfus'
2022/04/25 08:03:21 [components.password_generator] INFO: /ebooks/46447 : 'Ma confession'
2022/04/25 08:03:21 [components.password_generator] INFO: Found 4611 books

⇒ For now let's just retrieve a few of those books, let's say 20, randomly selected.

In fact this is giving me another idea: since I can collect so much text data in french, this could be used an an input to a deep learning network to write texts in french automatically ? ⇒ To be investigated further some day if I have the time for it

One step further now with the following version of the collect_words method where I handle the duplicated books from the initial list (so we only have 3400 books now boooooo!! 😂), and then downloading a few of them in a local folder:

    def collect_words(self):
        """Collecting words from text files."""

        # logger.info("Should collect text files here.")
        lang = self.get_param("language")

        # Get the text file links from gutenberg:
        url = f"https://www.gutenberg.org/browse/languages/{lang}"

        content = self.get_online_content(url)

        # Extract the urls:
        # https://www.gutenberg.org/ebooks/18812
        book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)</a>", content)

        book_urls = set()
        title_map = {}
        titles = set()

        for elem in book_desc:
            # logger.info("%s : '%s'", elem[0], elem[1])
            if elem[0] in book_urls:
                continue

            book_urls.add(elem[0])
            title = self.sanitize_title(elem[1])
            base_title = title
            idx = 1
            while title in titles:
                idx += 1
                title = f"{base_title}-{idx}"
                logger.info("Using alternate title: %s", title)
            titles.add(title)
            title_map[elem[0]] = title

        book_urls = list(book_urls)
        nbooks = len(book_urls)
        logger.info("Found %d books", nbooks)

        # download count
        max_num_books = self.get_param("num_books")
        count = 0
        if max_num_books == 0:
            max_num_books = nbooks

        random.shuffle(book_urls)

        tools = self.get_component("tools")
        data_dir = self.get_data_dir()
        dest_dir = self.get_path(data_dir, f"input_{lang}")
        self.make_folder(dest_dir)

        for i in range(nbooks):
            # try to download the book
            book_id = book_urls[i]
            url = f"https://www.gutenberg.org{book_id}.txt.utf-8"
            title = self.sanitize_title(title_map[book_id])

            dest_file = self.get_path(dest_dir, f"{title}.txt")

            # File should not exist already:
            if self.file_exists(dest_file):
                count += 1

            elif tools.download_file(url, dest_file, f"{count}/{max_num_books} "):
                count += 1

            if count >= max_num_books:
                break

        return True

Still working just fine so far:

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp collect-words
2022/04/25 08:31:05 [nvp.nvp_object] INFO: Sending request on https://www.gutenberg.org/browse/languages/fr...
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lettres-de-mon-moulin-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: la-tulipe-noire-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-mille-et-une-nuits-tome-premier-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: emaux-et-camees-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lodyssee-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lholocauste-roman-contemporain-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: lexilee-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: la-fille-du-capitaine-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-petites-filles-modeles-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-voyages-de-gulliver-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: autour-de-la-lune-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-3
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-3
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: le-tour-du-monde-en-quatre-vingts-jours-4
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: les-tribulations-dun-chinois-en-chine-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: candide-ou-loptimisme-2
2022/04/25 08:31:06 [components.password_generator] INFO: Using alternate title: salome-2
2022/04/25 08:31:06 [components.password_generator] INFO: Found 3440 books
2022/04/25 08:31:06 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/38335.txt.utf-8...
0/5 [==================================================] 6396/6396 100.000%
2022/04/25 08:31:07 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/42036.txt.utf-8...
1/5 [==================================================] 6396/6396 100.000%
2022/04/25 08:31:07 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/29282.txt.utf-8...
2/5 [==================================================] 283265/283265 100.000%
2022/04/25 08:31:09 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/30788.txt.utf-8...
3/5 [==================================================] 1224157/1224157 100.000%
2022/04/25 08:31:13 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/46541.txt.utf-8...
4/5 [==================================================] 6396/6396 100.000%

Then I realized that some of the text files download actually contained HTML code with indication that the URL was not correct: apparently there are at least 2 differents URL schemes to download the text files so I added support for that, trying multiple urls for each download.

And also eventually got this error:

2022/04/25 08:45:54 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/ebooks/57788.txt.utf-8...
Traceback (most recent call last):
  File "D:\Projects\NervProj\cli.py", line 5, in <module>
    ctx.run()
  File "D:\Projects\NervProj\nvp\nvp_context.py", line 291, in run
    if comp.process_command(cmd):
  File "D:\Projects\NervHome\components\password_generator.py", line 61, in process_command
    return self.collect_words()
  File "D:\Projects\NervHome\components\password_generator.py", line 134, in collect_words
    if tools.download_file(url, dest_file, f"{count}/{max_num_books} "):
  File "D:\Projects\NervProj\nvp\components\tools.py", line 258, in download_file
    with open(tmp_file, "wb") as fdd:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Projects\\NervHome\\data\\words\\books_fr\\avis-pour-les-religieuses-de-lord
re-de-lannonciade-celeste-fonde-a-genes-lannee-de-notre-salut-1604-brrimprimes-en-ladite-ville-amp-accomodes-a-la-pratique-de-lobservance
-des-constitutions-pour-linstruction-des-exercices-spirituels-a-lusage-des-monasteres-du-meme-ordre.txt.download'

So obviously, I need to protect myself against way too long title/filenames 😁!

And then I could finally start collecting some words from those text files, ensuring I would only consider “valid” words for the selected language. Which give us the following code for collect_words (note that I moved the first part of our work above in the method download_books)

    def collect_words(self):
        """Collecting words from text files."""

        chars = {
            "fr": "abcdefghijklmnopqrstuvwxyzâàæçéêëèïîôûùüÿœ",
            "en": "abcdefghijklmnopqrstuvwxyz"
        }

        lang = self.get_param("language")

        data_dir = self.get_data_dir()
        dest_dir = self.get_path(data_dir, f"books_{lang}")

        # Now we get the content of each book:
        books = self.get_all_files(dest_dir, exp="\.txt")
        if len(books) == 0:
            logger.info("No book downloaded yet.")
            return True

        words = set()

        allowed_chars = chars[lang]

        for book in books:
            # logger.info("Should read book: %s", book)
            content = self.read_text_file(self.get_path(dest_dir, book))
            all_words = content.split(" ")
            logger.info("Processing %d words book %s...", len(all_words), book)

            added = 0
            # process each word:
            for word in all_words:
                word = self.sanitize_word(word, allowed_chars)
                if word is not None and word not in words:
                    words.add(word)
                    added += 1

            logger.info("Added %d new words", added)

        # write the list of words:
        dest_file = self.get_path(data_dir, f"words_{lang}.txt")
        words = list(words)
        words.sort()

        logger.info("Writting %d words for language %s", len(words), lang)
        self.write_text_file("\n".join(words), dest_file)

        return True

    def sanitize_word(self, word, allowed):
        """Sanitize a given word for our current language of interest."""
        if not word.isalpha():
            return None

        word = word.lower()
        for char in word:
            if char not in allowed:
                return None

        return word.lower()

And with that, I will collect about 61k unique words from 50 french books already:

2022/04/25 09:24:57 [components.password_generator] INFO: Processing 60477 words book suzanne-et-le-pacifique.txt...
2022/04/25 09:24:57 [components.password_generator] INFO: Added 828 new words
2022/04/25 09:24:57 [components.password_generator] INFO: Processing 54861 words book suzanne-normis-roman-dun-pere.txt...
2022/04/25 09:24:57 [components.password_generator] INFO: Added 229 new words
2022/04/25 09:24:57 [components.password_generator] INFO: Processing 69463 words book un-hollandais-a-paris-en-1891-sensations-de-littera
ture-et-dart.txt...
2022/04/25 09:24:57 [components.password_generator] INFO: Added 407 new words
2022/04/25 09:24:57 [components.password_generator] INFO: Processing 111405 words book vercingetorix.txt...
2022/04/25 09:24:57 [components.password_generator] INFO: Added 1082 new words
2022/04/25 09:24:57 [components.password_generator] INFO: Processing 157230 words book vie-privee-et-publique-des-animaux.txt...
2022/04/25 09:24:58 [components.password_generator] INFO: Added 1079 new words
2022/04/25 09:24:58 [components.password_generator] INFO: Writting 61269 words for language fr

Which is certainly not bad at all already, but just for the fun I'm going to add a few more books to the input list 😋

So when trying to download more books I eventually got this other error:

2022/04/25 09:42:39 [nvp.components.tools] INFO: Downloading file from https://www.gutenberg.org/54035/54035-0.txt...
216/500 [==================================================] 191553/191553 100.000%
Traceback (most recent call last):
  File "D:\Projects\NervProj\cli.py", line 5, in <module>
    ctx.run()
  File "D:\Projects\NervProj\nvp\nvp_context.py", line 291, in run
    if comp.process_command(cmd):
  File "D:\Projects\NervHome\components\password_generator.py", line 66, in process_command
    return self.download_books()
  File "D:\Projects\NervHome\components\password_generator.py", line 147, in download_books
    content = self.read_text_file(dest_file)
  File "D:\Projects\NervProj\nvp\nvp_object.py", line 242, in read_text_file
    content = file.read()
  File "D:\Projects\NervProj\tools\windows\python-3.10.1\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 172089: invalid start byte

⇒ I need to improve the robustness when downloading then reading a text file:

                for url in urls:
                    if tools.download_file(url, dest_file, f"{count+1}/{max_num_books} "):
                        # Check if the content of the file is not html:
                        content = None
                        try:
                            content = self.read_text_file(dest_file)
                        except UnicodeDecodeError:
                            logger.error("Invalid unicode character in %s", dest_file)

                        if content is None or content.startswith("<!DOCTYPE html>"):
                            # Not what we want, discard that file:
                            logger.info("Invalid content at %s, discarding it.", url)
                            self.remove_file(dest_file)
                        else:
                            break

OK, and now processing 500 french books, I get a collection of 182k words 👍:

2022/04/25 10:16:19 [components.password_generator] INFO: Processing  voyage-en-orient-volume-2-les-nuits-du-ramazan-de-paris-a-cythere-lo
rely.txt...
2022/04/25 10:16:20 [components.password_generator] INFO: => Added 180 new words from 168709 source elements
2022/04/25 10:16:20 [components.password_generator] INFO: Processing  voyages-du-capitaine-lemuel-gulliver-en-divers-pays-eloignes-tome-i-
de-iii.txt...
2022/04/25 10:16:20 [components.password_generator] INFO: => Added 444 new words from 50923 source elements
2022/04/25 10:16:20 [components.password_generator] INFO: Processing  voyages-imaginaires-songes-visions-et-romans-cabalistiques-tome-35.t
xt...
2022/04/25 10:16:20 [components.password_generator] INFO: => Added 152 new words from 83807 source elements
2022/04/25 10:16:20 [components.password_generator] INFO: Processing  vue-generale-de-lhistoire-politique-de-leurope.txt...
2022/04/25 10:16:20 [components.password_generator] INFO: => Added 36 new words from 34304 source elements
2022/04/25 10:16:20 [components.password_generator] INFO: Writting 182560 words for language fr

⇒ Many of those words are probably not french in fact… could be town or people names: So maybe I should ignore words starting containing a capital letter: let's do that:

    def sanitize_word(self, word, allowed):
        """Sanitize a given word for our current language of interest."""
        if not word.isalpha():
            return None

        # Ignore word with capital letter:
        if any(ele.isupper() for ele in word):
            return None

        # Should not need to convert to lower case:
        word = word.lower()
        for char in word:
            if char not in allowed:
                return None

        return word.lower()

With that change we still get more than 147k words:

2022/04/25 10:21:43 [components.password_generator] INFO: Processing  voyage-en-espagne.txt...
2022/04/25 10:21:43 [components.password_generator] INFO: => Added 176 new words from 114676 source elements
2022/04/25 10:21:43 [components.password_generator] INFO: Processing  voyage-en-orient-volume-2-les-nuits-du-ramazan-de-paris-a-cythere-lo
rely.txt...
2022/04/25 10:21:43 [components.password_generator] INFO: => Added 109 new words from 168709 source elements
2022/04/25 10:21:43 [components.password_generator] INFO: Processing  voyages-du-capitaine-lemuel-gulliver-en-divers-pays-eloignes-tome-i-
de-iii.txt...
2022/04/25 10:21:44 [components.password_generator] INFO: => Added 396 new words from 50923 source elements
2022/04/25 10:21:44 [components.password_generator] INFO: Processing  voyages-imaginaires-songes-visions-et-romans-cabalistiques-tome-35.t
xt...
2022/04/25 10:21:44 [components.password_generator] INFO: => Added 133 new words from 83807 source elements
2022/04/25 10:21:44 [components.password_generator] INFO: Processing  vue-generale-de-lhistoire-politique-de-leurope.txt...
2022/04/25 10:21:44 [components.password_generator] INFO: => Added 29 new words from 34304 source elements
2022/04/25 10:21:44 [components.password_generator] INFO: Writting 147523 words for language fr

⇒ Let's try to use that now anyway ;-)!

Finally, here is the method implementation to generate the actuall multi-word password:

    def gen_password(self):
        """Generate an password given some input settings"""

        # logger.info("Should generate password here.")
        # Read the word file:
        lang = self.get_param("language")

        data_dir = self.get_data_dir()
        dest_file = self.get_path(data_dir, f"words_{lang}.txt")

        all_words = self.read_text_file(dest_file).splitlines()

        minl = self.get_param("min_len")
        maxl = self.get_param("max_len")

        kept_words = [word for word in all_words if len(word) >= minl and len(word) <= maxl]
        logger.info("Keeping %d / %d words", len(kept_words), len(all_words))

        num = self.get_param("num_words")

        # pick the random words:
        words = np.random.choice(kept_words, size=num)

        no_space = self.get_param("no_space")

        spacer = "" if no_space else " "
        password = spacer.join(words)
        logger.info("Generated password: \"%s\"", password)

        return True

And this is working great! 🤪:

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:24:50 [components.password_generator] INFO: Keeping 79137 / 147523 words
2022/04/25 10:24:50 [components.password_generator] INFO: Generated password: "surgeons parton jouaient quière issit"

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:25:02 [components.password_generator] INFO: Keeping 79137 / 147523 words
2022/04/25 10:25:02 [components.password_generator] INFO: Generated password: "cabat voilette logeroit épaissit allicere"

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:25:16 [components.password_generator] INFO: Keeping 79137 / 147523 words
2022/04/25 10:25:16 [components.password_generator] INFO: Generated password: "écrirait pleynte movent escria hérissé"

And for reference, here is the final code for the complete component in case someone is interested:

"""PasswordGenerator module"""
import logging
import re
import random

import numpy as np

from nvp.nvp_component import NVPComponent
from nvp.nvp_context import NVPContext

logger = logging.getLogger(__name__)


class PasswordGenerator(NVPComponent):
    """PasswordGenerator component class"""

    def __init__(self, ctx: NVPContext, proj=None):
        """Component constructor"""
        NVPComponent.__init__(self, ctx)

        # Store the project
        self.proj = proj
        self.data_dir = None

        desc = {
            "gen-password": None,
            "download-books": None,
            "collect-words": None
        }
        ctx.define_subparsers("main", desc)

        psr = ctx.get_parser('main.collect-words')
        psr.add_argument("-l", "--language", dest="language", type=str, default="fr",
                         help="language to collect the words")

        psr = ctx.get_parser('main.download-books')
        psr.add_argument("-l", "--language", dest="language", type=str, default="fr",
                         help="language to collect the words")
        psr.add_argument("-n", "--num", dest="num_books", type=int, default=500,
                         help="Number of books to collect to get the words")

        psr = ctx.get_parser('main.gen-password')
        psr.add_argument("-l", "--language", dest="language", type=str, default="fr",
                         help="Input language to use to collect the words")
        psr.add_argument("-n", "--num", dest="num_words", type=int, default=5,
                         help="Number of words to collect")
        psr.add_argument("--min", dest="min_len", type=int, default=3,
                         help="Minimum number of characters in the words")
        psr.add_argument("--max", dest="max_len", type=int, default=8,
                         help="Maximum number of characters in the words")
        psr.add_argument("-p", "--no-space", dest="no_space", action='store_true',
                         help="Remove the space between the words")

    def get_data_dir(self):
        """Get the data directory for the words"""
        if self.data_dir is None:
            self.data_dir = self.get_path(self.proj.get_root_dir(), "data", "words")

        return self.data_dir

    def process_command(self, cmd0):
        """Re-implementation of process_command"""

        if cmd0 == 'gen-password':
            return self.gen_password()

        if cmd0 == 'download-books':
            return self.download_books()

        if cmd0 == 'collect-words':
            return self.collect_words()

        return False

    def download_books(self):
        """Download the books"""

        lang = self.get_param("language")

        data_dir = self.get_data_dir()
        dest_dir = self.get_path(data_dir, f"books_{lang}")

        # logger.info("Should collect text files here.")

        # Get the text file links from gutenberg:
        url = f"https://www.gutenberg.org/browse/languages/{lang}"

        content = self.get_online_content(url)

        # Extract the urls:
        # https://www.gutenberg.org/ebooks/18812
        book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)</a>", content)

        book_urls = set()
        title_map = {}
        titles = set()

        for elem in book_desc:
            # logger.info("%s : '%s'", elem[0], elem[1])
            if elem[0] in book_urls:
                continue

            book_urls.add(elem[0])
            title = self.sanitize_title(elem[1])
            base_title = title
            idx = 1
            while title in titles:
                idx += 1
                title = f"{base_title}-{idx}"
                logger.info("Using alternate title: %s", title)
            titles.add(title)
            title_map[elem[0]] = title

        book_urls = list(book_urls)
        nbooks = len(book_urls)
        logger.info("Found %d books", nbooks)

        # download count
        max_num_books = self.get_param("num_books")
        count = 0
        if max_num_books == 0:
            max_num_books = nbooks

        random.shuffle(book_urls)

        tools = self.get_component("tools")

        self.make_folder(dest_dir)

        for i in range(nbooks):
            # try to download the book
            book_id = book_urls[i]
            url1 = f"https://www.gutenberg.org{book_id}.txt.utf-8"
            book_num = book_id[8:]  # discarding the /ebooks/ prefix
            url2 = f"https://www.gutenberg.org/{book_num}/{book_num}-0.txt"

            urls = [url1, url2]
            title = self.sanitize_title(title_map[book_id])

            dest_file = self.get_path(dest_dir, f"{title}.txt")

            # File should not exist already:
            if self.file_exists(dest_file):
                count += 1
            else:
                for url in urls:
                    if tools.download_file(url, dest_file, f"{count+1}/{max_num_books} "):
                        # Check if the content of the file is not html:
                        content = None
                        try:
                            content = self.read_text_file(dest_file)
                        except UnicodeDecodeError:
                            logger.error("Invalid unicode character in %s", dest_file)

                        if content is None or content.startswith("<!DOCTYPE html>"):
                            # Not what we want, discard that file:
                            logger.info("Invalid content at %s, discarding it.", url)
                            self.remove_file(dest_file)
                        else:
                            break

                if not self.file_exists(dest_file):
                    logger.info("Could not download book %s from known urls.", book_num)
                else:
                    count += 1

            if count >= max_num_books:
                break

        return True

    def sanitize_title(self, title):
        """Replace all characters in title that should not be used in file name"""
        slug = self.slugify(title)
        slug = slug[:100]
        return slug

    def collect_words(self):
        """Collecting words from text files."""

        chars = {
            "fr": "abcdefghijklmnopqrstuvwxyzâàæçéêëèïîôûùüÿœ",
            "en": "abcdefghijklmnopqrstuvwxyz"
        }

        lang = self.get_param("language")

        data_dir = self.get_data_dir()
        dest_dir = self.get_path(data_dir, f"books_{lang}")

        # Now we get the content of each book:
        books = self.get_all_files(dest_dir, exp="\.txt")
        if len(books) == 0:
            logger.info("No book downloaded yet.")
            return True

        words = set()

        allowed_chars = chars[lang]

        for book in books:
            # logger.info("Should read book: %s", book)
            content = self.read_text_file(self.get_path(dest_dir, book))
            all_words = content.split(" ")
            logger.info("Processing  %s...", book)

            added = 0
            # process each word:
            for word in all_words:
                word = self.sanitize_word(word, allowed_chars)
                if word is not None and word not in words:
                    words.add(word)
                    added += 1

            logger.info("=> Added %d new words from %d source elements", added, len(all_words))

        # write the list of words:
        dest_file = self.get_path(data_dir, f"words_{lang}.txt")
        words = list(words)
        words.sort()

        logger.info("Writting %d words for language %s", len(words), lang)
        self.write_text_file("\n".join(words), dest_file)

        return True

    def sanitize_word(self, word, allowed):
        """Sanitize a given word for our current language of interest."""
        if not word.isalpha():
            return None

        # Ignore word with capital letter:
        if any(ele.isupper() for ele in word):
            return None

        # Should not need to convert to lower case:
        word = word.lower()
        for char in word:
            if char not in allowed:
                return None

        return word.lower()

    def gen_password(self):
        """Generate an password given some input settings"""

        # logger.info("Should generate password here.")
        # Read the word file:
        lang = self.get_param("language")

        data_dir = self.get_data_dir()
        dest_file = self.get_path(data_dir, f"words_{lang}.txt")

        all_words = self.read_text_file(dest_file).splitlines()

        minl = self.get_param("min_len")
        maxl = self.get_param("max_len")

        kept_words = [word for word in all_words if len(word) >= minl and len(word) <= maxl]
        logger.info("Keeping %d / %d words", len(kept_words), len(all_words))

        num = self.get_param("num_words")

        # pick the random words:
        words = np.random.choice(kept_words, size=num)

        no_space = self.get_param("no_space")

        spacer = "" if no_space else " "
        password = spacer.join(words)
        logger.info("Generated password: \"%s\"", password)

        return True

And this is it: in the end some of the words you get seem almost like they are from a foreign language in fact, but hey I think that's good enough for a first version: I can still request more words that what I need and only pick the ones I know right 😅 ?

Or maybe I could simply start with ignoring words with accents for instance 🤔, let's see… only 105664 words found this way, and indeed, the results seem a bit closer to “usable” french that way:

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:31:17 [components.password_generator] INFO: Keeping 58496 / 105664 words
2022/04/25 10:31:17 [components.password_generator] INFO: Generated password: "estoyent hors crussiez apparatu desquex"

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:31:25 [components.password_generator] INFO: Keeping 58496 / 105664 words
2022/04/25 10:31:25 [components.password_generator] INFO: Generated password: "audias mouettes dorsaux piqueur chicaner"

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password
2022/04/25 10:31:32 [components.password_generator] INFO: Keeping 58496 / 105664 words
2022/04/25 10:31:32 [components.password_generator] INFO: Generated password: "amorties ramez forgeaf mesamer partire"

Anyway, now I'm out of this “quick project” ;-) took me a couple of hours which is too long already 😁! See yaa ✌

Update: I just realized that for english books, we cannot retrieve the list of book from a single page: instead we would have to process by author name on multiple pages, and only select the books with “(English)” after the title: but that would be a somewhat non trival task, so I won't do that right now.

  • blog/2022/0425_password_generation.txt
  • Last modified: 2022/04/25 11:42
  • by 127.0.0.1