I have a personal project I just undertook to "mimic" Stathead Baseball (https://www.sports-reference.com/stathead/sport/baseball/) which is an awesome site. I created a local DuckDB database as the backend and plan to create a front end to query the DB as I have searching through historical stats. I had to pull data from three different sources:
Lahman Baseball Database (https://sabr.org/lahman-database/)
Downloaded the CSV files
Retrosheet game logs (https://www.retrosheet.org/gamelogs/index.html)
Chadwick Baseball Bureau Persons Register (https://github.com/chadwickbureau/register)
I associated the search categories to the datasets (screenshot). I also attached the schema as well. The biggest challenge in this project is that Lahman and Retrosheet use different Player IDs. For example, a player might be bondsb01 in Lahman but bondb001 in Retrosheet. I thought the solution was to use DuckDB to join these datasets using a crosswalk or register table like the one provided by the Chadwick Bureau so downloaded those files, but ran into an issue there as the Chadwick registry lacked a direct key_lahman column in the specific export, so pivoted to the native stg_lahman.People table as the definitive cross-reference hub. For the master view I constructed v_player_master, which is a high-performance view that consolidates player metadata, name variants, and external ID mappings (MLBAM, BBRef, Retrosheet). This view serves as the primary join-point for all cross-dataset analytical queries. With all of that said, I know enough to know that I don't know everything and would love to see if someone could review the DB to point out the good, the bad, and the ugly. Was considering sites like Fiverr or Upwork and would appreciate any feedback. Thx
.png)