Serverless Web Scraping with Google Firebase and Cloud Function

vinay mavi
2 min readOct 8, 2018

Web scrapping is one of the good ways to make government information available in a more meaningful and more interactive way where we as a civilized society can interact with information and information can be automatically pushed to us when required.

When we talking about the government, every government has a lot of information and information provided in a simple informational format and particularly the Indian government where digitalization evolving slowly.

India has 17 to 18 percentage of world population and the major population lives in villages and data of these villages is huge. To display all these villages information in an interactive and meaningful way is definitely is a challenge for government and would be a more big challenge for individuals and non-profit organizations to represent this information in meaningful and interactive way with push feature and this challenge going to be more bigger when the pages are AJAX driven.

Here we are having the same type of challenge, need to scrap a section of Indian government website http://planningonline.gov.in/ReportData.do?ReportMethod=getAnnualPlanReport. This website has activity plan report of all indian villages for 5 financial year.

Need to do incremental scrapping of 1.5 Million pages
Need to support AJAX

I am an individual and using serverless approach to complete this job as serverless platform come with the significant amount of free quota and it also helps to avoid server hassles that help to focus on work and improve productivity.

Google Cloud Function + Google Chrome Puppeteer + Google Firebase looks the best technology stack for this challenge and thanks to Google to bring the latest NodeJS LTS support to Google Cloud Functions.

Install NodeJS

Install google firebase CLI tool

npm install -g firebase-tools

Runfirebase lgoin to log in via the browser and authenticate the firebase tool.

Clone project from github

https://github.com/vinaymavi/graminbharat.git

and run npm install .

Here functions/index.js

file is creating a event based google cloud function that will execute on every new document creation on collection row1 and insert the new documents to same collection and this cycle work like a loop and continue to until the last record fetched.

Run firebase deploy — only functions to deploy function to google cloud.

insert a document in row1 collection

 {‘url’:’URL’,stage:’PLAN_YEAR’}

this insertion start the execution loop.

CAUTION: This type of execution loop can run infinitely, please check your code before deploy carefully.

References:
https://firebase.google.com/docs/functions/get-started
https://github.com/GoogleChrome/puppeteer
https://firebase.google.com/docs/cli/
https://github.com/vinaymavi/graminbharat
https://medium.com/@ebidel/puppeteering-in-firebase-google-cloud-functions-76145c7662bd

--

--

vinay mavi

Vinay is a Cloud Architect, T Shape Developer, Blogger, A Open Source Contributor, Reverse Mentorship Believer.