Scrape Data without Selenium by Exposing Hidden APIs

Rendra Sukana
Dev Genius
Published in
8 min readAug 22, 2022

--

Photo by James Harrison on Unsplash

Recently, I am working on a side project that requires the product data from Tokopedia, one of the largest e-commerce websites in Indonesia. The main objective was to build a web scraper function that can be used reliably to obtain the products data and the seller data from the search results page, which could be processed as a raw dataset or to generate insights further downstream. We resort to web scraping because just like many other websites, Tokopedia do not possess any API to query its data from.

In this article, I assume that you possess a basic understanding about Python programming, making basic API requests and using HTTP Methods.

A Faster & Easier Way

Even though Tokopedia does not publish any documented API, The website possesses a “hidden” API that we can utilize. There are several interesting things about using this API.

First, the API has an extremely fast response time. Bear in mind, this is the API used to query the products shown to the users. Because the content delivery time should be amazingly fast, we can incorporate this speed into our web scraper. Also, by using this API, we do not have to load JavaScript, images, render HTML and other things that we need to deal with should we use Selenium. This further cuts down the loading time.

Second, the program is easy to build. As you will see, the code is not bloated at all. Aside from the Python language itself, I mainly just used Requests and Pandas. The scraper itself, without the header and query (which we’re going to copy from the browser anyway) is just a few lines of code. We do not even need to deal with user-agent spoofing, web drivers and other components. We just send the API requests, then the API will happily send the response. This decreases development time, and makes the program easier to debug.

Sounds fun? Let’s get started!

1. Exposing The Hidden API

Let’s see the hidden API that I am talking about. First, we need our browsers to expose the API. For some reason, Firefox failed to expose the API. Thus, I recommend Chrome in this instance.

Search Result Page of Tokopedia.com

Then, we will search some products on www.tokopedia.com. I tried to find a “White Linen Shirt”, which happens to be my favorite type of shirt.Shortly after, Tokopedia will show a number of products on its search results page.

Right click on anywhere on the screen then click “Inspect” to bring up the developer tools. Then, go to the “Network” tab. This tab monitors what requests are sent to what endpoints, so if there is any hidden API, it is going to be revealed here. Then, click “XHR”. If you’re curious what XHR is, it’s short for XMLHttpRequest, and it’s a JavaScript object used to transfer data. Essentially, by clicking “XHR”, we are separating the APIs that are trying to fetch data, from the APIs that are trying to fetch images, HTML, CSS, or JavaScript. Finally, reload the page to monitor the requests.

XHR objects under the “Network” Tab of Developer Tools. Did you see anything interesting?

Clicking “XHR” will reveal all the HTTP methods that are used to transfer data. By clicking on each request name, we can view the request as well as the response headers, request payload and the site’s response body. A tip: Try to evaluate the API name and see if something is indicating a search product query, then view its response body to verify it.

Clicking on the “SearchProductQueryV4” and checking the response.

If you look closely enough, there is a request with a name of “SearchProductQueryV4”. The API name is quite interesting. Indeed, previewing the response body shows that the API returns the products and the seller data, which are the information that we are interested in.

Right click on the method and then click “copy as cURL”. This copies the endpoints, headers, and request body then formats it to be compatible with cURL. You can also copy the request with the format for fetch, or powershell. The endpoints, headers, and request body will all still be copied.

Copying the API as cURL. Other formats also available.

2. Testing The API with Requests Library

Next, we need to clean up the formatted request a bit. You can use tools like postman or insomnia for this part, but manually separating this won’t hurt. We essentially just need to separate the endpoint, the headers, and the query. If we paste the copied request to a text editor, then we’ll get something like this:

The request still in cURL format.

Let us break this request into several parts:

  1. The first line, the part after the command curl is the API endpoint. An endpoint is simply an address you send your requests to. It is likened to a regular website address, but for sending API requests.
  2. The second line, which are preceded by the option -H are the headers of the request. This section contains several interesting bits of information, including the infamous user-agent header.
  3. Finally, we arrive at the request body, which are prefixed by -data--raw. This is where the query is located and will be the place where you ask exactly what you want from the API.

Let us took a few moments to separate these parts into variables.

Separating each parts into different variables.

Next, we need to analyze which parts of the query does what. I discovered that there are three parts of information that are particularly important: page, q and start. q accepts the search terms, while page together with start controls what information are going to be loaded. For example, if we are loading information for the second page of the search result, we should supply the query with page=2 and start=60 . start should be 60 because the previous 0–59 products are loaded in the first page (Yes, Tokopedia loads 60 products per page).

We can use f-string to vary the values of these query arguments. This way, we can manipulate the page, q and start. However, since this query contains brackets, we need to escape them using another bracket. To do this, replace every brackets with double brackets.

Using f-string to vary the query arguments. Other brackets (not shown) are escaped with double brackets.

Then, we use a Python requests module to fire up a test request. Aside from supplying the endpoint, do not forget to also supply the headers and the request body. Fire up the request and see if the server delivers a response as we expected.

Using JSON library to print the response body. Part of the product data is shown.

As we can see, the server includes a JSON file in its response body. Fortunately, it includes the product data that we are looking for. Now, some of the values are in Indonesian, because Tokopedia itself currently operating in Indonesian market.

Now, not every single thing in the headers or query body is important. Aside from page, q and start, I choose to left these two as-is. Of course, you can tweak the headers and query body to determine if leaving several things won’t affect your code.

3. Parsing The Response

After testing the response, we need to parse the response for something we are interested in. We can extract the response body by writing response.json(). Usually the method returns a dictionary, but for some reason it returns a list instead. Let’s put the result inside a variable called response_body.

The response body returns a list in this instance.

Inside response_body all we need to do is traverse through the file to see where the product data might reside. Then, it’s just a matter of chaining the index and subindexes together. In this Tokopedia instance, one could find the product data in the given chain of index:

response_body[0]['data']['ace_search_product_v4']['data']['products']

which will return a list of dictionary. You can directly put this into a Pandas Dataframe, by calling pd.DataFrame.from_records() like so:

Printing the result as a DataFrame Object to The Terminal.

Once you have the result, you can export them as CSV or put it on your database.

4. Completing The Function

Up to this point, the query actually returns the product data for only a single page. To obtain all product data for the given search result, We just need a function that obtains the data for all the pages, like so:

The check() function returns the total number of products retrieved from the search as srp_count and the total page retrieved from the search as page_count. In a sense, it “checks” the search result page to see how much products, and in how many pages, the result is containing. Hence, the name. Finally, the scrape() function serves as the main function that retrieves the data from the all the search result pages, instead of a single page only. scrape() requires both srp_count and page_count as variable arguments for the query, this is where the check() comes in handy.

Now, there are several features that we could add. Perhaps, we could include certain price range limitation, or include only a certain seller member status. We can add these features by modifying the query and supplying them as arguments in the functions. Or, we can also add multi-threading like I did here.

As usual, the whole, functional code is available on my GitHub. The final length of the code is barely over 50 lines, but that’s mostly because of the lengthy headers. In actuality, building this scraper is barely more complicated than completing a single HTTP request.

Summary

By thinking beyond HTML, we can build a web scraper that are faster, more efficient, sends much less request to the server and also less verbose. Always look for a hidden API when doing web scraping, because when web scrapers work faster with less requests, that’s when both you and the site owner could be happy.

I hope this story helps you in your journey. Feel free to comment if you find any mistakes, and clap if you like it. Thank you.

--

--

Data Engineering Enthusiast — Talks about Web Crawling, Data Engineering & AWS