Tools: Python (Beautiful Soup, Pandas), Excel, Tableau
1. What is Web Scraping?
Web scraping is the process of extracting data from websites. It involves using software or scripts to access a website's HTML code, parse it, and extract specific information from it. This information can include text, images, links, and more, depending on the needs of the scraper. Web scraping is commonly used for various purposes, such as data collection, data analysis, research, and automation.
(Source: Chat GPT)
2. What is Beautiful Soup?
In this project, I use Beautiful Soup, a Python liabrary for web scraping purpose to pull the data out of HTML and XML files. Then we can extract and clean the data to the data format that we want.
![](https://static.wixstatic.com/media/8e630b_0731ead307784c8c8704fb3e04cba88a~mv2.png/v1/fill/w_980,h_583,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/8e630b_0731ead307784c8c8704fb3e04cba88a~mv2.png)
Example of code in Beautiful Soup liabrary, source: Chat GPT
3. My Web Scraping Project
In this project, I chose Wikipedia as the website to extract information data of the 100 largest companies in the US.
Here is the link: List of largest companies in the United States by revenue - Wikipedia
I used Beautiful Soup for scraping the data, then Pandas to clean it and extract the final data into an Excel CSV file. Finally, I connected the CSV file to Tableau to visualize the data and gain insights from it.
See my code on Github
Here is the final result on Python (Jupyter Notebook).
![](https://static.wixstatic.com/media/8e630b_5b04fe995c564ff0a87351fbfb80426b~mv2.png/v1/fill/w_980,h_518,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/8e630b_5b04fe995c564ff0a87351fbfb80426b~mv2.png)
After exporting this file from Jupyter Notebook to .csv Excel file, I obtained a clean and beautiful database.
![](https://static.wixstatic.com/media/8e630b_cda3d06121fc433186b64e8a8ea5c75d~mv2.png/v1/fill/w_980,h_928,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/8e630b_cda3d06121fc433186b64e8a8ea5c75d~mv2.png)
You can freely download my file here: Top 100 largest companies in the US
To visualize the data, I use Tableau to create a map that displays the density of large corporations in the US. Additionally, I can view detailed information about each company by hovering over its corresponding dot.
![](https://static.wixstatic.com/media/8e630b_fbda5dfddb4d4ba7ac8cbf36ca068ba3~mv2.png/v1/fill/w_980,h_825,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/8e630b_fbda5dfddb4d4ba7ac8cbf36ca068ba3~mv2.png)
My dashboard is on Tableau Publish Server now.
In conclusion, in this project, I demonstrated how I created an end-to-end data analytics process. First, I scraped raw data from a website using Python Beautiful Soup and cleaned it with Pandas. Next, I downloaded and organized the data in a worksheet using Excel. Finally, I used Tableau to connect the database and visualize the data, then I created beautiful and interactive informative visualizations.
To interact with my dashboard, view my visualizations on Tableau Publish.
Additional information:
I have learned this effective practice of web scraping using Beautiful Soup (Python) from Alex The Analyst, one of the current big data gurus.
Check out his tutorial here: Scraping Data from a Real Website | Web Scraping in Python - YouTube
Comments