RunningAHEAD Analyzer

18 November 2024

Last month I analyzed and wrote about my run streaks. I’m a bit ashamed to admit that I did the analysis through Excel. Excel has its uses but I’m a data engineer and I really should be doing something that is more capable to be run easier when my heart desires it. My running log of choice, RunningAHEAD doesn’t really have an API that I could tap into so I am left with their log export functionality into a tab-delimited text file. In addition, I’ve been playing around with DuckDB which natively supports a lot of file formats where you can analyze the data through SQL. This got me thinking and I spent a little time over the last few days to build the initial version of my RunningAHEAD Analyzer. It’s by no means complete. Today it contains a Jupyter notebook that I’ll outline in this post.

Today it fully goes from the Download to the building the output of my streaks that are greater than 10 days ordered by longest streak. The output looks like this:

┌─────────────┬───────────────────┬───────────────────┬────────────────┬────────────┐
│ StreakGroup │ min_day_of_streak │ max_day_of_streak │ total_distance │ total_days │
│   int128    │       date        │       date        │     double     │   int64    │
├─────────────┼───────────────────┼───────────────────┼────────────────┼────────────┤
│         134 │ 2011-08-31        │ 2012-04-26        │        1089.79 │        240 │
│         858 │ 2024-08-19        │ 2024-11-18        │         372.22 │         92 │
│         406 │ 2017-08-27        │ 2017-11-25        │         375.12 │         91 │
│         470 │ 2018-11-12        │ 2019-02-08        │         338.14 │         89 │
│         249 │ 2013-09-21        │ 2013-10-09        │          97.47 │         19 │
│         596 │ 2020-10-18        │ 2020-11-05        │         135.17 │         19 │
│         857 │ 2024-07-29        │ 2024-08-14        │          55.78 │         17 │
│         294 │ 2014-06-29        │ 2014-07-14        │           59.7 │         16 │
│         471 │ 2019-02-11        │ 2019-02-26        │          73.33 │         16 │
│         317 │ 2015-03-30        │ 2015-04-12        │          39.08 │         14 │
│         586 │ 2020-07-31        │ 2020-08-13        │          88.26 │         14 │
│         223 │ 2013-05-04        │ 2013-05-16        │           94.8 │         13 │
│         605 │ 2021-01-05        │ 2021-01-17        │         100.07 │         13 │
│         614 │ 2021-03-11        │ 2021-03-23        │          85.11 │         13 │
│         216 │ 2013-04-01        │ 2013-04-11        │          57.22 │         11 │
│         604 │ 2020-12-24        │ 2021-01-03        │          67.45 │         11 │
├─────────────┴───────────────────┴───────────────────┴────────────────┴────────────┤
│ 16 rows                                                                 5 columns │
└───────────────────────────────────────────────────────────────────────────────────┘

The timing is impecable as my current streak moved into my second longest streak of all time. I’m pretty happy about that!

So, how does it work? Someday I’ll clean up the code and give a proper README.md in the repository, but for now that is why I’m making this post. There are various blocks of code that I’ll document.

In this first block of code I’m relying on Selenium with Python. It works by using a webdriver, in this case I’m using Chrome because I couldn’t get Safari to work. Selenium is a framework that allows you to automate web browser tasks. I’m loading the RunningAHEAD website, looking for the Login information, logging in, going to the Export page, and finally downloading the ZIP file with the log. Rather than hardcoding my username and password and other things, I put them in a .env file.

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)

driver.get("https://www.runningahead.com/")

driver.find_element(By.ID, "ctl00_ctl00_ctl00_SiteContent_PageContent_MainContent_email").send_keys(os.getenv("RA_USER"))
driver.find_element(By.ID, "ctl00_ctl00_ctl00_SiteContent_PageContent_MainContent_password").send_keys(os.getenv("RA_PASS"))
driver.find_element(By.ID, "ctl00_ctl00_ctl00_SiteContent_PageContent_MainContent_login_s").click()

driver.get("https://www.runningahead.com/logs/" + os.getenv("RA_ID") + "/tools/export")

driver.find_element(By.ID, "ctl00_ctl00_ctl00_SiteContent_PageContent_TrainingLogContent_Download_s").click()

time.sleep(10)

driver.quit()

After the ZIP file is downloaded, this block of code extracts it. The log.txt file was the biggest pain in the ass as DuckDB could not read it properly. After hours of troubleshooting I discovered that the header incorrectly has one extra tab. I struggled a lot on this, so I just hardcoded the new header line. This section is a good candidate to refactor in the future.

import zipfile
import csv

expected_file = os.getenv("RA_FOLDER") + os.getenv("RA_NAME") + ".tab.zip"
with zipfile.ZipFile(expected_file, 'r') as zip_ref:
    zip_ref.extractall(os.getenv("RA_FOLDER") + "runningahead-logs/")

log_file = os.getenv("RA_FOLDER") + "runningahead-logs/log.txt"

with open(log_file, "r") as f:
    lines = f.readlines()

# The original file has an extra tab that messes up everything
new_header = "Date	TimeOfDay	Type	SubType	Distance	DistanceUnit	Duration	Weight	WeightUnit	RestHR	AvgHR	MaxHR	Sleep	Calories	Quality	Effort	Weather	Temperature	TempUnit	Notes	Course	CourseSurface	CourseNotes	ShoeMake	ShoeModel	Size	System	ShoeSerial	ShoePrice	OverallPlace	FieldSize	GroupMinAge	GroupMaxAge	GroupPlace	GroupSize	GenderPlace	GenderSize\n"

with open(log_file, "w") as f:
    f.write(new_header)

    count = 0
    for line in lines:
        if count > 0:
            f.write(line + "\n")
        count = count + 1

Finally, I have the CSV data and I’m ready to start the analysis by firing up DuckDB and loading. I’m not persisting any data at this point, so I’m using the in-memory database of the DuckDB process.

import duckdb

connection = duckdb.connect(database=':memory:')

log = connection.read_csv(log_file)

The fun thing about using DuckDB in Python is that when you set the output of a call the SQL function, you can then use it as a table later int he code. The data contains different unit of measures so I create a pseudo-table with the conversion factors to get everything into miles.

# Create a conversion table so everything can be converted through SQL
conversion = connection.sql("""
    SELECT 'Mile' AS Unit, 1 AS ConversionFactor
    UNION
    SELECT 'Kilometer' AS Unit, 0.6213712 AS ConversionFactor
    UNION
    SELECT 'Meter' AS Unit, 0.0006213712 AS ConversionFactor
""")

Now the fun… lots of common table expressions and window functions. The first step is to convert the distance to miles and filtering out any non-Running activities. After that is done, I create an aggregated dataset by day as I regularly do multiple run workouts in a day. Then I use a LAG function so for each day I can check if the Date is the previous Day and that I ran over 1 mile. If I don’t, the StreakBreak is set. That feeds into my grouped log where I create a StreakGroup identifier based on the total StreakBreak by Date. Finally, I create a ROW_NUMBER() based on the StreakGroup to set the final identifier. This can get a little wild here!

streak_log = connection.sql("""
    WITH convert_to_miles AS (
        SELECT
            l.Date
            , l.Distance * c.ConversionFactor AS DistanceMiles
        FROM log AS l
        INNER JOIN conversion AS c ON l.DistanceUnit = c.Unit
        WHERE Type = 'Run'
    ), aggregated_log AS ( 
        SELECT
            Date
            , SUM(DistanceMiles) AS Distance
        FROM convert_to_miles
        GROUP BY
            Date
    ), streak_groups AS (
        SELECT
            Date
            , Distance
            , CASE
                WHEN Date - 1 = LAG(Date) OVER (ORDER BY Date) AND Distance >= 1 THEN 0
                ELSE 1
            END AS StreakBreak
        FROM aggregated_log
    ), grouped_log AS (
        SELECT
            Date
            , Distance
            , SUM(StreakBreak) OVER (ORDER BY Date) AS StreakGroup
        FROM streak_groups
    )
    SELECT
        Date
        , Distance
        , StreakGroup
        , ROW_NUMBER() OVER (PARTITION BY StreakGroup ORDER BY Date) AS Streak
    FROM grouped_log
    ORDER BY Date;
""")

To get the output I want, I create that dataset to get some MIN/MAX values:

connection.sql("""
    SELECT
        StreakGroup
        , MIN(Date) AS min_day_of_streak
        , MAX(Date) AS max_day_of_streak
        , ROUND(SUM(Distance), 2) AS total_distance
        , MAX(Streak) AS total_days
    FROM streak_log
    GROUP BY
        StreakGroup
    HAVING total_days > 10
    ORDER BY total_days DESC, min_day_of_streak ASC
""")

Last, but not least, I clean up the files that were downloaded so they don’t interfere with me running the notebook again:

import shutil

# clean up the ZIP file
os.remove(expected_file)

# clean up the extracted files
files = os.getenv("RA_FOLDER") + "runningahead-logs/"
shutil.rmtree(files)

It works. Not as well as I’d like, but it works. I’ll continue working and will make a post when it’s ready to be used. For now, I’d recommend checking out the repository on GitHub for further updates.

running data engineering python duckdb