Google Music Playlist to CSV Using Pandas
Table of Contents
Recently I wanted to save the songs in my Google Music playlists as text files, but I couldn’t find a way to do it using the available menus in Google Music. It then occurred to me that I should be able to scrap a playlist web page to extract the data from a table element. To add a challenge to this task, I decided to use the Python pandas library to do it. As it turned out, it was a great decision that made the data clean up process very simple.
I start this post with the final outcome, the Python script songs.py that takes a Google music playlist file in HTML format as input and produces a text CSV file as output. Here is the command line on my terminal window:
./songs.py sr.html
saved file sr.csv with 156 songs
The input file sr.html
contains the 156 songs from my Solid Rock playlist. I
created this file by first visiting the playlist Google web page and then
saving the page to my local hard drive. The playlist is small, but long enough
to illustrate the conversion process from HTML to CSV formats.
Web Scraping #
My first attempt to scrap the playlist’s web page with pandas was this line:
In [1]: import pandas as pd
In [2]: table = pd.read_html('https://play.google.com/music/listen...') # long URL here!
Out [2]: ...
ValueError: No tables found
The Google web page is complex, with lots of JavaScript code and links to
dynamic content loaders, but not a single HTML table is present. However, when I
save the page in the Web Page, complete
format, which is the default in the
Firefox browser, I get one HTML file (sr.html
) and a directory full of
images, scripts, CSS files, and many other elements. As it happens, this single
HTML file does have a table with the songs:
In [1]: import pandas as pd
In [2]: table = pd.read_html('sr.html')
In [3]: len(table)
Out[3]: 1
In [4]: len(table[0])
Out[4]: 41
The pandas read_html
method scans the input and returns a list of DataFrame
objects, one per each HTML table it finds. Line 3 above tells me that pandas
found one table element and therefore there is one DataFrame in the list. Line 4
queries the number of rows in the DataFrame and finds only 41, not the 156 in
the playlist.
Here is what happens. The content of the page is generated dynamically but the output includes only enough songs to fill in the browser’s page, and a bit more, the exact number is unknown. The taller the display area becomes, for instance by reducing the page zoom, the higher the number of songs that are captured in the file. In my case I can capture 92 songs in a single file-save operation when the zoom is at its allowed minimum of 30% (the text is very small to read at this level). As I scroll past the first page, a new set of songs is generated which I can capture in a second file. Here are two screen shots side by side of the web page, 90% zoom on the left and 30% zoom on the right.
I decided then to minimize the zoom, save the HTML content as a file, press
Page Down
, save the content again, and so on. The alternative was to look into
the JavaScript code and to attempt to do some magic to save the whole playlist
at once. But I stayed with the manual process because, first, my playlists are
not very long, and second, because I don’t want to learn JavaScript ☺.
I needed to save three files, sr{1,2,3}.html
, to capture the 156 songs in my
Solid Rock playlist. I then concatenated them into a single file and used the
latter as the input to the songs.py
script, as follows:
cat sr1.html sr2.html sr3.html > sr.html
./songs.py sr.html
saved file sr.csv with 156 songs
As a side note, you can concatenate the songs using pandas instead:
In [1]: import pandas as pd
In [2]: table1 = pd.read_html('sr1.html')
In [3]: table2 = pd.read_html('sr2.html')
In [4]: table3 = pd.read_html('sr3.html')
In [5]: table = pd.concat(table1 + table2 + table3)
Concatenating the three files, or equivalently, concatenating the DataFrames in pandas, results in a number of duplicate entries because some songs overlap in each captured file. This and other elements in the resulting DataFrame need to be addressed as part of a data clean up process.
Data Clean Up #
The first step after loading the HTML file is to look for information about the DataFrame of songs:
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('sr.html')
In [3]: songs = pd.concat(dfs)
In [4]: songs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 0 to 36
Data columns (total 7 columns):
# 182 non-null float64
Name 182 non-null object
Unnamed: 2 182 non-null object
Artist 182 non-null object
Album 182 non-null object
Unnamed: 5 180 non-null float64
Unnamed: 6 0 non-null float64
dtypes: float64(3), object(4)
memory usage: 11.9+ KB
There are 190 entries, including duplicates and invalid lines. There are seven columns of which I am interested in only three: Name (song name), Artist, and Album. The other four are # (song number), Unnamed: 2 (song duration), Unnamed: 5 (how many times a song has been played), and Unnamed: 6 (a thumbs up/down flag).
Let’s drop the unwanted columns and all rows with invalid (NaN) content (there are several of them):
In [5]: songs.drop(['Unnamed: 2', 'Unnamed: 5', 'Unnamed: 6'], axis=1, inplace=True)
In [6]: songs.dropna(inplace=True)
In [7]: songs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 0 to 36
Data columns (total 4 columns):
# 182 non-null float64
Name 182 non-null object
Artist 182 non-null object
Album 182 non-null object
dtypes: float64(1), object(3)
memory usage: 7.1+ KB
Next is to eliminate duplicates, which are songs with the same value in the # column, and to drop that column which I don’t need anymore:
In [8]: songs.drop_duplicates('#', inplace=True)
In [9]: songs.drop('#', inplace=True, axis=1)
In [10]: songs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 36
Data columns (total 3 columns):
Name 156 non-null object
Artist 156 non-null object
Album 156 non-null object
dtypes: object(3)
memory usage: 4.9+ KB
Excellent, now I have the right number of songs in the DataFrame, 156. What is left is to sort the entries by Artist → Album → Song name, re-index the dataset, and rearrange the columns in that order:
In [11]: songs.sort_values(by=['Artist', 'Album', 'Name'], inplace=True)
In [12]: songs.reset_index(inplace=True)
In [13]: songs = songs.reindex(columns=['Artist', 'Album', 'Name'])
Here are the first ten entries in the dataset:
In [14]: songs.head(10)
Out[14]:
Artist Album Name
0 3 Doors Down 3 Doors Down It's Not My Time
1 3 Doors Down Away From The Sun Here Without You
2 3 Doors Down The Better Life Be Like That
3 3 Doors Down The Better Life Kryptonite
4 3 Doors Down The Greatest Hits When I'm Gone
5 3 Doors Down Time Of My Life When You're Young
6 Against Me! Shape Shift With Me Crash
7 Alter Bridge Blackbird Rise Today
8 Audioslave Audioslave Like a Stone
9 Audioslave Out of Exile Be Yourself
Saving the DataFrame to a CSV File #
I derive the name of the CSV file from the name of the input file by changing
the suffix .html
to .csv
:
In [15]: from pathlib import Path
In [16]: path = Path('sr.html')
In [17]: csv = path.with_suffix('.csv')
In [18]: print(csv)
sr.csv
I can then generate the CSV file as follows:
In [19]: songs.to_csv(csv)
Back now to the command line on my terminal window:
head -11 sr.csv
,Artist,Album,Name
0,3 Doors Down,3 Doors Down,It's Not My Time
1,3 Doors Down,Away From The Sun,Here Without You
2,3 Doors Down,The Better Life,Be Like That
3,3 Doors Down,The Better Life,Kryptonite
4,3 Doors Down,The Greatest Hits,When I'm Gone
5,3 Doors Down,Time Of My Life,When You're Young
6,Against Me!,Shape Shift With Me,Crash
7,Alter Bridge,Blackbird,Rise Today
8,Audioslave,Audioslave,Like a Stone
9,Audioslave,Out of Exile,Be Yourself
If you do not want the index column and the header first row on the CSV file then use the following command:
In [20]: songs.to_csv(csv, index=False, header=False))
This is the new output:
head -11 sr.csv
3 Doors Down,3 Doors Down,It's Not My Time
3 Doors Down,Away From The Sun,Here Without You
3 Doors Down,The Better Life,Be Like That
3 Doors Down,The Better Life,Kryptonite
3 Doors Down,The Greatest Hits,When I'm Gone
3 Doors Down,Time Of My Life,When You're Young
Against Me!,Shape Shift With Me,Crash
Alter Bridge,Blackbird,Rise Today
Audioslave,Audioslave,Like a Stone
Audioslave,Out of Exile,Be Yourself
Mission accomplished!