Statistical Inference R SQL

Decoding Spotify Popularity

In the music industry, featured artists are considered a cheat code for virality. But does adding more artists actually guarantee more streams? I built an end-to-end statistical pipeline to find out.

View Raw Code on GitHub

The Context

Record labels and independent musicians frequently invest heavy budgets into collaborative tracks, operating on the assumption that combining artist fanbases mathematically ensures greater reach.

The goal of this project was to move past industry intuition and apply rigorous statistical testing to real Spotify data. Specifically, I wanted to answer these fundamental questions:

1. Stream Volume

Do collaborative tracks inherently generate a statistically higher mean of streams compared to tracks exclusively by a solo artist?

2. Chart Probability

Are collaborative tracks statistically more likely to break into the Spotify Global Charts?

3. Statistical Power

Assuming a true 10% difference in charting rates, does the dataset possess enough statistical power to detect this effect?

Data Engineering & The Pipeline

Analytical models are only as reliable as the data fed into them. Working with a raw dataset of 952 tracks, the data initially contained formatting issues that broke standard mathematical functions.

  • SQL Profiling: Used SQLite (sqlite_master, LIMIT queries) to rapidly profile the schema, validate total row counts, and understand the raw data structure before loading it into memory.
  • Aggressive Coercion (R): The streams column was improperly formatted as character strings. I applied numeric casting while actively isolating and suppressing coercion warnings to cleanly identify and drop NA rows without corrupting the dataset.
  • Feature Engineering: Created a binary charted flag and a categorical group segment based on artist counts.
  • Logarithmic Transformation: Streaming data suffers from extreme positive skew (a few tracks have billions of streams, most have thousands). I engineered a log_streams feature to normalize the distribution, making it suitable for a t-test.

Finding 1: Log-Streams

To evaluate if collaborations trigger a higher volume of streams, I executed a Welch Two-Sample t-test (one-sided) targeting the engineered log_streams metric.

The results contradicted industry assumptions entirely. The mean log-streams for collaborations (19.29) was actually slightly lower than solo tracks (19.64).

H0: μ_collab - μ_solo ≤ 0
H1: μ_collab - μ_solo > 0

Result: Fail to reject H0 (p = 1)

There is absolutely no statistical evidence supporting the theory that collaborative songs yield a higher average stream count than solo tracks.

Boxplot of Log Streams by Group
[Visualization: Boxplot comparing Log Streams between Solo and Collab groups]
Visualizing the variance and mean of log(streams) by group.

Finding 2: Charting Probability

While absolute stream counts didn't favor collaborations, I hypothesized they might still act as a viral catalyst, pushing songs onto the Spotify charts faster. To test this, I ran a Two-Proportion Z-Test.

59.5%

Collab Chart Rate (218 of 366 tracks)

56.3%

Solo Chart Rate (330 of 586 tracks)

Despite a visible 3.2% increase in charting frequency for collaborations, the statistical p-value returned at 0.179, leading us to fail to reject H0 once again.

Result: Fail to reject H0

The observed increase is not statistically significant. We cannot confidently claim that collaborations improve the likelihood of a song charting.

Finding 3: Statistical Power

A crucial step in rigorous analytics is acknowledging the mathematical limits of a model. To validate my findings, I ran a Statistical Power Analysis (power.prop.test) aiming to detect a true 10% difference in charting rates with an 80% detection chance.

The test revealed that 388 rows per group were required. Because my collaboration group only contained 366 valid rows, the dataset was marginally underpowered to confidently detect that specific effect size.

Limitations & What I Learned

  • Data Cleaning is Paramount: A single improperly formatted string can break an entire statistical pipeline. Aggressive, precise cleaning was the most critical step of this project.
  • Data > Intuition: The music industry heavily favors collaborations, but the data proved that solo tracks hold their ground flawlessly. Real-world data often contradicts widely accepted assumptions.
  • Causation vs. Correlation: This was an observational dataset. While we found no significant popularity boost from collaborations, proving definitive causation would require randomized control testing.