Want to learn pandas, but don't know where to start? That was my position about a week ago. In this post, I'll explain how I structured my learning process during my one week 'crash course.' By no means am I an expert now, but I feel confident to say I can accomplish essential data cleaning and visualization tasks.
You can find the git repository and associated Jupyter notebooks here.
My goal for this project was to get some hands on experience manipulating, cleaning and visualizing a random data set that I found online. I chose the Denver Neighborhood Demographics dataset (from the 2010 census) because it was a relatively smaller dataset so it was less intimidating for a beginner. I also am interested in moving to Denver so I figured learning more about the city would be interesting, even if the data was older.
You can find the dataset here. I also used this tutorial about Exploratory data analysis as a rough guide for the project.
This was relatively simple and involved importing pandas and reading the data in from a CSV file I had downloaded to my laptop. I also made sure that all columns would be displayed if I ran df.head() so I could get a sense of what information was included.
This involved perusing the data to see what information I was interested in modeling. I looked at some of the basic statistical data using df.describe() and I also made sure to check for any null values, duplicate rows, and that all the data types were appropriate.
Fortunately there were no duplicate rows, no null values, and all of the datatypes were suitable for the information they held. Cleaning simply involved dropping unnecessary columns and renaming the remaining ones to be more reader-friendly.
At this point I began to create graphs using the data. I decided to explore the total population distribution, racial makeup, and gender distribution of Denver's neighborhoods. I played around with creating several different graph types and overlaying two charts on top of each other.
My desire to learn Pandas comes from my interest in AI and Machine Learning. When I worked through Harvard's Intro to AI course on edX, they specifically said not to use Pandas and NumPy if you weren't already familiar with them. This lead to me feeling a bit behind the ball when I looked into pursuing my own ML projects after I finished the course. I now understand why Pandas is so popular - the ease with which you can clean and analyze data is awesome!
I also find data analytics fascinating in general. One of the things I love about coding is you essentially have an army of robots ready to do your bidding. I especially see this with data analytics. To have hand-plotted those charts or run each of those calculations through a calculator would have taken ages...that's a bit of a rustic example, but it shows how cool the world is today.
I hope to do a more in-depth project using Pandas in a few weeks. Only this time I will set out with a clear goal in mind as opposed to this time where my goal was to merely 'mess around.' Learning Pandas was fun and I can't wait to use my new skills, but I found messing around with no purpose to be rather boring. Perhaps my next project will involve using Pandas to format data for some sort of ML task. We'll see! Today I plan to redo my development environment and start learning how to web scrape.