blog:2022:0429_password_generation_part2

Quick project: generating multiwords passwords - part 2 [python]

Hello world! So a few days ago, I started a “quick project” to generate “multiword passwords” with a simple python utility. And the most complex part in that tool was to retrieve a dataset of unique words in a given language. That turned out to be “no that tricky” in french, but the solution I implemented at that time is not working when trying to collect english words unfortunately 😅.

⇒ So this is what we will address here, let's start 👍!

The main problem with english books on https://www.gutenberg.org is that we have too many of them. So the page “https://www.gutenberg.org/browse/languages/en” will not display any book list, and instead we have to iterate on all the author pages to collect the english book names.

So let's see how we can implement that…

The typical entries we should be looking for will look like this:

<li class="pgdbetext"><a href="/ebooks/61217">1,492,633 Marlon Brandos</a> (English) (as Author)</li>

And… well, here we go already: I simply moved the part from download_books() where I first retrieve the list of all books in a dedicated method get_book_list and provided some customization to iterate on the sub pages at “https://www.gutenberg.org/browse/authors/” when the language is “en”:

    def get_book_list(self, lang):
        """Retrieve our full list of books in this language"""

        book_urls = set()
        title_map = {}
        titles = set()

        def process_book_descs(descs):
            for elem in descs:
                # logger.info("%s : '%s'", elem[0], elem[1])
                if elem[0] in book_urls:
                    continue

                book_urls.add(elem[0])
                title = self.sanitize_title(elem[1])
                base_title = title
                idx = 1
                while title in titles:
                    idx += 1
                    title = f"{base_title}-{idx}"
                    logger.info("Using alternate title: %s", title)
                titles.add(title)
                title_map[elem[0]] = title

        if lang == "en":
            # Special handling for english:

            # list all the pages that should be visited:
            plist = ["other"] + list(string.ascii_lowercase)

            base_url = "https://www.gutenberg.org/browse/authors/"
            for suffix in plist:
                url = f"{base_url}{suffix}"
                # logger.info("Should collect english books from page: %s", url)

                content = self.get_online_content(url)
                book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)</a> \(English\)", content)
                logger.info("Processing %d books from author page '%s'", len(book_desc), suffix.upper())

                process_book_descs(book_desc)

        else:
            # Get the text file links from gutenberg:
            url = f"https://www.gutenberg.org/browse/languages/{lang}"

            content = self.get_online_content(url)

            # Extract the urls:
            # https://www.gutenberg.org/ebooks/18812
            book_desc = re.findall(r"href=\"(/ebooks/[0-9]+)\">(.+)</a>", content)

            process_book_descs(book_desc)

        return book_urls, title_map

⇒ And now this seems to be working just fine already 😁 Download of 500 english books currently in progress, let's see how this goes.

And this is giving me an idea: instead of just collecting all the “words” from all the books for a given language, I could rather assign a count statistic to each word: then I could discard words that are very rarely used! ⇒ I'll try that as a next step.

So as mentioned just above i then thought it would be good to track the words count, so I updated the collect_words() method with this content at the end, using a panda dataframe to write a csv file:

        words = {}

        allowed_chars = chars[lang]

        for book in books:
            # logger.info("Should read book: %s", book)
            content = self.read_text_file(self.get_path(dest_dir, book))
            all_words = content.split(" ")
            logger.info("Processing  %s...", book)

            added = 0
            # process each word:
            for word in all_words:
                word = self.sanitize_word(word, allowed_chars)
                if word is None:
                    continue

                if word not in words:
                    words[word] = 1
                    added += 1
                else:
                    words[word] += 1

            logger.info("=> Added %d new words from %d source elements", added, len(all_words))

        # write the list of words:
        dset = pd.DataFrame(words.items(), columns=['Word', 'Count'])

        dest_file = self.get_path(data_dir, f"words_{lang}.csv")
        # words = list(words)
        # words.sort()
        dset.sort_values(by="Count", inplace=True, ascending=False)

        logger.info("Writting %d words for language %s", len(words), lang)
        # self.write_text_file("\n".join(words), dest_file)
        dset.to_csv(dest_file, index=False)

Then of course I had to update the gen_password() method accordingly to support reading that csv file, and also discarding the words with low count:

        data_dir = self.get_data_dir()
        dest_file = self.get_path(data_dir, f"words_{lang}.csv")

        dset = pd.DataFrame(pd.read_csv(dest_file, keep_default_na=False, dtype={"Word": str, "Count": int}))

        # indices = dset.index[dset.Word.isnull()]
        # logger.error("Invalid word indices: %s", indices)
        # indices = dset.index[dset.Count.isnull()]
        # logger.error("Invalid count indices: %s", indices)

        assert not dset.isnull().values.any(), f"Nan values detected in {dest_file}"

        logger.info("Loaded %d words", dset.shape[0])

        # filter the words that are rarely used:
        min_count = self.get_param("min_count", 0)

        mean = statistics.mean(dset.Count)
        sig = statistics.stdev(dset.Count)
        logger.info("Mean word count: %f, stddev: %f", mean, sig)

        idx = dset.Count >= min_count
        # all_words = dset[idx].Word.tolist()
        all_words = list(dset[idx].Word.values)
        logger.info("Keeping only %d words with min count threshold", len(all_words))

        # all_words = self.read_text_file(dest_file).splitlines()

Note: In this process I actually got some trouble because of how pandas will parse csv files by default, and because I have the english word “null” in that list of words: this initially gave be a NaN value of course. ⇒ The solution is to use keep_default_na=False when calling read_csv() ;-).

⇒ And this is it: now I can generate my mulitword passwords in english too:

kenshin@Saturn /cygdrive/d/Projects/NervHome
$ nvp gen-password -l en
2022/04/29 13:45:37 [components.password_generator] INFO: Loaded 83352 words
2022/04/29 13:45:37 [components.password_generator] INFO: Mean word count: 251.541247, stddev: 9010.520064
2022/04/29 13:45:37 [components.password_generator] INFO: Keeping only 28248 words with min count threshold
2022/04/29 13:45:37 [components.password_generator] INFO: Keeping 18541 / 28248 words
2022/04/29 13:45:37 [components.password_generator] INFO: Generated password: "catering foulest asset elicit scholars"

$ nvp gen-password -l en -n 20
2022/04/29 13:46:35 [components.password_generator] INFO: Loaded 83352 words
2022/04/29 13:46:36 [components.password_generator] INFO: Mean word count: 251.541247, stddev: 9010.520064
2022/04/29 13:46:36 [components.password_generator] INFO: Keeping only 28248 words with min count threshold
2022/04/29 13:46:36 [components.password_generator] INFO: Keeping 18541 / 28248 words
2022/04/29 13:46:36 [components.password_generator] INFO: Generated password: "bandage simple coming ordering cord framed heretics gnarled excreted yell
reacted conjure debates craven lend stamps dainty moisten leaving damsel"

  • blog/2022/0429_password_generation_part2.txt
  • Last modified: 2022/04/29 12:54
  • by 127.0.0.1