Small project workflow: The making of `foxtail`

2024-02-28

Something I’ve wanted to do for a while is to make a tool to make it easier to write content for this blog. Often I save bookmarks when I browse the internet, and when I save bookmarks I make a mental note to come back to that link to revisit and review it and perhaps even write about it (a mental note that is doomed to be forgotten in the business of the day).

So I thought about how could I build a tool to make that process easier. I’ve got much more experience than before, and challenged myself to cobble together a tool for this in a matter of hours (and also post about it, because perhaps there’s something to learnd from writing/reading this).

The problem statement

A tool like this would make it easier to write “newsletter” type posts. Often when I feel like writing – especially about topics I’ve read about online – I get discouraged because I feel like I wont be able to avoid repeating points that were much more eloquently stated in the source material.

On the other hand, I think there’s a bit of value in taking the time to write summaries or lists of “stuff I found interesting this week”. It may not be read by anyone, but it would be a more detailed way to keep track of good content over time.

I use Firefox on all my devices, the syncing of bookmarks is great. However I’m not a huge fan of the GUI for manually managing the bookmarks and reviewing them. Being a command line junkie, it would be ideal to be able to execute a single command to dump a list of recently added bookmarks to a file (formatted in Markdown) so I can quickly remove junk bookmarks from the post and add a bit of commentary on why each link was noteworthy to me.

Quick googling

I know the path I want to take, this sort of deadline demands a language that’s fast to work in. I’m not designing for other users – really at this point it’s just a proof of concept I wanted to prove to myself.

So I searched for how to approach this. Looks like Firefox uses a places.sqlite file to track user data like this. I recall sqlite is part of the baselib of python (from a compile error I experienced a while ago), which made me wonder if it’d be possible to query it.

I had no idea where to find this file though. Running some fd commands in ~ reveals some results after a bit of trial and error. After a few attempts, I printed the directories of where *.sqlite were stored in ~:

1$ fd -HI --extension 'sqlite' | cut -d '/' -f 1-3 | uniq
2# .dots/ipython/.ipython
3# .mozilla/firefox/ikgk9jo5.default-release

Only 2 directories to look though! Using a more direct query:

1$ fd -HI 'places.sqlite'
2# .mozilla/firefox/ikgk9jo5.default-release/places.sqlite
3# .mozilla/firefox/ikgk9jo5.default-release/places.sqlite-wal

Alrighty, we’ve got a target to experiment with!

Experimental setup and equipment

I have a central repo for experiments and one-off sort of exercises that don’t really fit into their own repo called scratchpad. I don’t fret best practices here, this is just for experiments.

To make sure dependencies don’t get in the way, I’ll create a new project with poetry new called “bookmarks” and cd bookmarks; vim scrap.py, which is where I’ll try to open and query places.sqlite. I’ll launch poetry shell to launch my virtual environment.

In a future post I’ll probably write more extensively about my vim/tmux/python setup, but the vim plugins I rely on for this are:

Main reasons I use this: I’m a command line junkie and I like my vim keybindings. While I liked the idea of jupyter, I’m less enthused by leaving my terminal and keybindings away. (Plus I prefer to keep my work out of the browser so I can alt-tab instead of ctrl-tab and getting lost in the 30+ tabs I have open).

With this setup, I can write code between # %% blocks and execute these blocks with <C-c><C-c> to pipe this text into a neighboring tmux split with an running ipython session.

fig1

So let’s get writing!

From exploration to prototype

Import some stuff and try open it (googling along the way when I get something I don’t understand)

 1from pathlib import Path
 2from warnings import warn
 3import sqlite3 as sql
 4import shutil
 5
 6
 7# %% Connect to database
 8
 9databases = [
10    fp for fp in Path("~/.mozilla/firefox").expanduser().glob("**/*.sqlite") if fp.stem == "places"
11]
12
13if len(databases) == 0:
14    raise FileNotFoundError("Couldn't find `places.sqlite`.")
15
16database = databases[0]
17if (n_databases := len(databases)) > 1:
18    warn(f"Found {n_databases} `places.sqlite` files, using {str(database)}.")
19
20con = sql.connect(database_cp)

This works! But I encountered a file lock error when trying to .execute a basic query – probably locked by Firefox while it’s open. I’ll quickly just copy it and see if that works

1# %% Connect to database (simple copy)
2
3database_cp = Path("/tmp/bkmk/foo.sqlite")  # Seems like file is locked, just copy it lol
4database_cp.parent.mkdir(parents=True, exist_ok=True)
5shutil.copy(database, database_cp)
6
7database = Path("~/foo.sqlite").expanduser()
8con = sql.connect(database)

Now we’re in business! How can I list the avliable tables? Some quick googling suggests this method:

 1# %% List tables in the database
 2
 3query = "SELECT name FROM sqlite_master WHERE type='table';"
 4cursor = con.cursor()
 5
 6res = cursor.execute(query).fetchall()
 7
 8print(res)
 9# [('moz_origins',),
10#  ('moz_places',),
11#  ('moz_places_extra',),
12#  ('moz_historyvisits',),
13#  ('moz_historyvisits_extra',),
14#  ('moz_inputhistory',),
15#  ('moz_bookmarks',),
16#  ('moz_bookmarks_deleted',),
17#  ('moz_keywords',),
18#  ('sqlite_sequence',),
19#  ('moz_anno_attributes',),
20#  ('moz_annos',),
21#  ('moz_items_annos',),
22#  ('moz_meta',),
23#  ('moz_places_metadata',),
24#  ('moz_places_metadata_search_queries',),
25#  ('moz_previews_tombstones',),
26#  ('sqlite_stat1',)]

Looks like moz_bookmarks would be worth querying:

 1# %% List columns in `moz_bookmarks`
 2
 3query = "SELECT * FROM moz_bookmarks"
 4res = cursor.execute(query).description
 5print([r[0] for r in res])
 6# ['id',
 7#  'type',
 8#  'fk',
 9#  'parent',
10#  'position',
11#  'title',
12#  'keyword_id',
13#  'folder_type',
14#  'dateAdded',
15#  'lastModified',
16#  'guid',
17#  'syncStatus',
18#  'syncChangeCounter']
19
20# %% What's in here??
21query = "SELECT * FROM moz_bookmarks"
22res = cursor.execute(query).fetchall()
23print(res)
24# Seems like `dateAdded` is in epoch seconds?

Ah, but this table doesn’t have URLs? The table moz_places seems like it might be a pretty central table, what’s in there?

 1# %% List columns in `moz_places`, Maybe there's a key to join on? probs `guid`?
 2
 3query = "SELECT * FROM moz_places"
 4
 5res = cursor.execute(query)
 6print("---")
 7print(res.fetchall())
 8print("---")
 9print([r[0] for r in res.description])
10# ['id',
11#  'url',
12#  'title',
13#  'rev_host',
14#  'visit_count',
15#  'hidden',
16#  'typed',
17#  'frecency',
18#  'last_visit_date',
19#  'guid',
20#  'foreign_count',
21#  'url_hash',
22#  'description',
23#  'preview_image_url',
24#  'site_name',
25#  'origin_id',
26#  'recalc_frecency',
27#  'alt_frecency',
28#  'recalc_alt_frecency']

The answer didn’t jump out immediately, but I managed to get it after a bit of trial and error. This vim-slime setup allows me to tighten that REPL loop, so I can get the solution faster (even if many trials and errors are needed).

 1# %% How about id? Pretty sure this is a local table id key though
 2
 3q1 = "SELECT fk FROM moz_bookmarks"
 4res1 = cursor.execute(q1).fetchall()
 5res1 = {r[0] for r in res1}
 6
 7q2 = "SELECT id FROM moz_places"
 8res2 = cursor.execute(q2).fetchall()
 9res2 = {r[0] for r in res2}
10
11print("---")
12print(f"{len(res1)=}")
13print(f"{len(res2)=}")
14print(f"{len(res2 & res1)=}")
15print("---")
16# ---
17# len(res1)=359
18# len(res2)=8699
19# len(res2 & res1)=346
20# ---
21
22# %% Bingo! Try join and query
23
24query = """
25SELECT p.url, b.title, b.dateAdded
26FROM moz_places p INNER JOIN moz_bookmarks b
27ON p.id = b.fk;
28"""
29
30res = cursor.execute(query)
31print("---")
32print(res.fetchmany(5))
33# [('https://www.mozilla.org/privacy/firefox/', 'mobile', 1708123255251000),
34# ('https://blog.inkdrop.app/how-to-set-up-neovim-0-5-modern-plugins-lsp-treesitter-etc-542c3d9c9887',
35# 'How To Manage Python with Pyenv and Direnv | DigitalOcean', 1678053275812000),
36# ('https://discuss.pytorch.org/t/how-to-implement-keras-layers-core-lambda-in-pytorch/5903',
37# 'Programmer Interrupted: The Real Cost of Interruption and Context Switching', 1680757287482000),
38# ('https://github.com/shMorganson/dot-files/blob/a858e28d1adbc0a5a7b13d8a2600c2014ec8b376/nvim/.config/nvim/lua/plugins/highlights/custom_highlights.vim',
39# 'Gentle Dive into Math Behind Convolutional Neural Networks | by Piotr Skalski | Towards Data
40# Science', 1634531361569000), ('https://muhammadraza.me/2022/data-oneliners/', 'CNN Explainer',
41# 1634539466342000)]

Amazing! I’d like to see If I can work in base python, but for exploration I’ll !pip install pandas in my ipython REPL for a bit of quick and dirty exploration

 1# %% Nice! Now use `WHERE` and chuck it all into a DataFrame
 2
 3import time
 4from datetime import timedelta
 5import pandas as pd
 6
 7cur_time = int(time.time())
 8delta = timedelta(hours=7).total_seconds()
 9
10query = f"""
11SELECT p.url, b.title, b.dateAdded
12FROM moz_places p INNER JOIN moz_bookmarks b
13ON p.id = b.fk
14WHERE b.dateAdded > {(cur_time - delta) * 1e6};
15"""
16
17res = cursor.execute(query)
18res = pd.DataFrame(res.fetchall(), columns=[r[0] for r in res.description])
19print("---")
20print(res)
21# ---
22#                                                  url                                              title         dateAdded
23# 0  https://www.youtube.com/watch?v=MCs5OvhV9S4&li...  David Beazley - Python Concurrency From the Gr...  1709086750127000
24# 1        https://www.youtube.com/watch?v=zduSFxRajkE            Let's build the GPT Tokenizer - YouTube  1709091384964000
25
26
27# %% Alrighty take these results and sort and groupby day (just a bit of pandas munging)
28#
29from datetime import datetime
30import pytz
31
32res["dtAdded"] = res["dateAdded"].apply(
33    lambda x: datetime.fromtimestamp(x / 1e6, tz=pytz.timezone("Australia/Melbourne"))
34)
35res = res.sort_values("dateAdded")
36
37lines = []
38lines.append("# Bookmarks from the last week")
39lines.append("")
40
41for k, v in res.groupby(res["dtAdded"].dt.date):
42    lines.append(f"## {k}")
43    lines.append("")
44    for _, row in v.sort_values("dateAdded").iterrows():
45        lines.append(f"[{row['title']}]({row['url']})")
46        lines.append("")
47
48print("---")
49print("\n".join(lines))
50print("---")
51# ---
52# # Bookmarks from the last week
53#
54# ## 2024-02-28
55#
56# [David Beazley - Python Concurrency From the Ground Up: LIVE! - PyCon 2015 - YouTube](https://www.youtube.com/watch?v=MCs5OvhV9S4&list=WL&index=2)
57#
58# [Let's build the GPT Tokenizer - YouTube](https://www.youtube.com/watch?v=zduSFxRajkE)
59#
60# ---

Now looks like all the pieces are there. Let’s bring it all together in a single block, remove the pandas dependency, and make it executable with a single function main():

 1# %% Sick! That prototype works. Now let's get it working in an E2E block without requiring pandas
 2
 3import argparse
 4import shutil
 5import sqlite3
 6import tempfile
 7from datetime import timedelta
 8from itertools import groupby
 9from pathlib import Path
10from time import time
11from typing import Dict, List, Tuple
12
13
14def parse() -> argparse.Namespace:
15    parser = argparse.ArgumentParser()
16    parser.add_argument("firefox_dir", default="~/.mozilla/firefox")
17    return parser.parse_args()
18
19
20def get_database(firefox_dir: Path | str) -> Path:
21    firefox_dir = Path(firefox_dir).expanduser()
22    databases = [
23        fp
24        for fp in firefox_dir.glob("**/*.sqlite")
25        if fp.stem == "places"
26    ]
27
28    if len(databases) == 0:
29        raise FileNotFoundError("Couldn't find `places.sqlite`.")
30
31    database = databases[0]
32    if (n_databases := len(databases)) > 1:
33        warn(f"Found {n_databases} `places.sqlite` files, using {str(database)}.")
34
35    return database
36
37
38def query_database(database: Path, delta: float | int) -> List[Tuple[str, str, int]]:
39    cur_time = time()
40    with sqlite3.connect(database) as con:
41        cursor = con.cursor()
42
43        query = f"""
44        SELECT p.url, b.title, b.dateAdded
45        FROM moz_places p INNER JOIN moz_bookmarks b
46        ON p.id = b.fk
47        WHERE b.dateAdded > {(cur_time - delta) * 1e6};
48        """
49
50        return cursor.execute(query).fetchall()
51
52
53def format_results(results: List[Tuple[str, str, int]]) -> List[str]:
54    grouped_results: Dict[str, List[Tuple[str, str, int]]] = {
55        k: list(v)
56        for k, v in groupby(results, key=lambda x: datetime.fromtimestamp(int(x[2] / 1e6)).date())
57    }
58
59    lines = []
60    lines.append("# Bookmarks from the last week")
61    lines.append("")
62
63    for date in sorted(grouped_results.keys()):
64        lines.append(f"## {date}")
65        lines.append("")
66
67        date_results = sorted(grouped_results[date], key=lambda x: x[2])
68        for url, title, _ in date_results:
69            lines.append(f"[{title}]({url})")
70            lines.append("")
71
72    return lines
73
74
75def main():
76    args = parse()
77    database = get_database(args.firefox_dir)
78    with tempfile.TemporaryDirectory() as tmpdirname:
79        database_cp = Path(tmpdirname) / "places.sqlite"
80        shutil.copy(database, database_cp)
81        results = query_database(database_cp, delta=timedelta(days=7).total_seconds())
82
83    lines = format_results(results)
84    print("---")
85    print("\n".join(lines))
86    print("---")
87
88
89main()

And there we have it! There were a few bugs I had to weed out in the 20-ish minutes of writing and debugging that, but starting with def main(): and adding comments, and turning each comment into its own function, helped to add the basic structure needed to make this comprehensible.

And so, this final block (with some minor edits) was good enough to lump into bookmarks/__main__.py. I added a basic argparse to allow specifying a different firefox directory (in case it’s not on ~/.mozilla/firefox) and adjusting delta to a custom number of hours and days.

And finally to spice up the name, I consulted ChatGPT for a few suggestions, and foxtail was a pretty nice one! So I changed the name, copied this __main__.py script to a new repo and added a README and uploaded it to BlakeJC94/foxtail

Conclusion

This was a pretty nice exercise, and thought it would be nice to document how I approached a small scope project for my own review. Hopefully there will be some new content here soon, some of these bookmarks I summarise may even branch out into their own posts!

Reply to this post by email ↪