Data scraping

In order to have accurate and reliable data for our prototype, we devised a workflow through which to scrape headlines from around the world. This data, although scraped through manual means as a smaller sample set, still provided us with an accurate overview that seemed to verify our assumptions – different countries do indeed talk about news elsewhere differently.

Target countries and VPN list

Since our refined concept investigates how certain countries present the news from other countries differently, we derived a list of 6 target countries (United States, Brazil, China, Finland, Egypt and Australia) that our master list of VPN countries would all then search headlines about. We chose these target countries in particular not only to give more equal representation across the world (as each is from a separate continent), but also as countries that have made significant headlines during the coronavirus epidemic (our chosen topic). Only exception to this is for Europe, where we chose Finland instead of Italy to get a more personal perspective.

Our master list of VPN countries comprised of 26 countries that we had access to via our individual VPNs. While it was in the end rather Euro-centric (as most of the VPNs we had available to us were European), we did manage to have access to a few countries from North America, Asia and Australia. These countries, along with their country code, are listed in the screenshot below.

 

Workflow

To start the data scraping process, we would turn on a VPN in an incognito Chrome browser wiped from all previous search history and then Google search “coronavirus” (our chosen topic) along with the target country. So an example search would be: Canadian VPN on, Google search “coronavirus brazil.”

The above post shows what the search return for that example looks like. We chose then the top five headlines, taking the first hit from each group. (In this case, the top two shown were “As coronavirus spreads globally, Brazil’s president visits a…” and “Brazilian official who met Trump at Mar-a-Lago tests positive…“) If a return did not pertain to the target country, that entry was skipped. In addition, to maintain consistency throughout the collection, we limited the time frame to 1-31 March 2020.

The metadata was then copied into a plain text VSCode file, making sure to retain important information such as date and news outlet, along with a short sentence taken as either the sub-header to the headline or the first sentence within the article itself.

 

International countries

This process was rather straightforward when using VPNs for countries that speak English natively. However, to ensure that there was not an English-bias in the news sources for the VPN countries without English as the main language, we translated the search terms to be accurate in the native language. This same search in the Mexican VPN, for example, would have been “coronavirus brasil,” which is the Spanish word for Brazil.

Of course, this also meant that the search results would also be in the native language. To get around this, we used the automatic Google translate extension within Chrome to toggle back and forth between English and the native language.

Although it required a few extra steps to toggle between languages for the non-native English countries, this vastly improved the quality of the data as otherwise, we would only see media from other English-speaking countries. This is not necessarily reflective of the news most of the country would be seeing so it was important to include this caveat within the data collection workflow.

There after, it was just rinse and repeat for all the countries on the VPN master list. Each VPN country searched each of the six target countries before moving on to the next VPN on the master list. While the manual method may not have been the most efficient, it did allow for us to review the data as we scraped it and improve our copy/paste skills at the same time.