Web Crawler in Java
In this article, you will be acknowledged with what a web crawler in java is and what are its functions. You will also be able to understand where to implement it. This
Web Crawler Definition
A web crawler is essentially an application used mostly for web navigation and page discovery so that new or newly created pages can be found and indexed. The crawler explores wide and deep to extract hyperlinks, starting with a variety of seed websites or well-known URLs.
The Breadth-First Search Algorithm's web crawler is one of its most crucial applications. A directed graph is supposed to be able to represent the entire internet.
The web crawler needs to be robust and kind. Here, being polite involves abiding by robots.txt directives and avoiding making repeated website visits. The term "robust" refers to the capacity to avoid harmful conduct and spider webs.
The below is the procedure followed to create a web crawler
- We choose a URL out from frontier in the first phase.
- Obtain the URL's HTML code.
- By interpreting the HTML code, you can obtain the urls to the other URLs.
- Verify whether or not the URL has already been crawled. We also determine whether we have already seen the same content. Then add them to that same index if neither condition matches.
- Check to see if each extracted URL consents to being examined (robots.txt, crawling frequency).
Note: This code will not work on an online IDE due to proxy issues. Try to run on your local computer.
Distinctions among data crawling and data scraping
Both data crawling as well as data scraping are crucial ideas in data processing. Data crawling entails working with big data sets and creating a custom crawler that can reach even the most buried web pages. Data extraction from any source is known as data scraping.
Data Crawling | Data Scraping |
Data is solely extracted from the web via data crawling. | Data extraction from any source, including the web, is known as data scraping. |
Duplication is a fundamental component of data crawling. | Duplication is not always an element of data scraping. |
Most of it is carried out on a big volume. | Any scale, whether small or huge, can be used. |
Just a crawl agent is needed. | Both the search crawler and the agent are necessary. |
Let us understand the concept with an example program
File name: Webcrawl.java
// Example program for java web crawler
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
//This class holds the methods needed by the web crawler
class WebCrowler {
// to keep the Links in the BFS required FIFO sequence
private Queue<String> queue;
// To save visited Links
private HashSet<String>
discovered_websites;
// Constructor for setting up the necessary variables
public WebCrowler()
{
this.queue
= new LinkedList<>();
this.discovered_websites
= new HashSet<>();
}
// a method to launch the BFS and find all URLs
public void discover(String root)
{
// preserving the base URL to launch BFS
this.queue.add(root);
this.discovered_websites.add(root);
// till the queue is empty, it will loop.
while (!queue.isEmpty()) {
// To save the URL that is now in the lead of the queue
String v = queue.remove();
// to save the website's unprocessed HTML
String raw = readUrl(v);
// Using regular expressions with URLs
String regex
= "https://(\\w+\\.)*(\\w+)";
// To save the URL pattern created using regex
Pattern pattern
= Pattern.compile(regex);
// To extract all the URL that
// matches the pattern in raw
Matcher matcher
= pattern.matcher(raw);
// It will loop until all the URLs
// in the current website get stored
// in the queue
while (matcher.find()) {
// To store the next URL in raw
String actual = matcher.group();
// It will check whether this URL is
// visited or not
if (!discovered_websites
.contains(actual)) {
// If not visited it will add
// this URL in queue, print it
// and mark it as visited
discovered_websites
.add(actual);
System.out.println(
"webpage accessed: "
+ actual);
queue.add(actual);
}
}
}
}
// Function to return the raw HTML
// of the current website
public String readUrl(String v)
{
// Initializing empty string
String raw = "";
// If this code throws any exceptions, handle them in a try-catch statement.
try {
// Change the string in the URL
URL url = new URL(v);
// Website HTML should be read
BufferedReader be
= new BufferedReader(
new InputStreamReader(
url.openStream()));
// In order to save website input
String input = "";
// line for line through the HTML and add it to raw
while ((input
= br.readLine())
!= null) {
raw += input;
}
br.close();
}
catch (Exception ex) {
ex.printStackTrace();
}
return raw;
}
}
public class Webcrawl {
public static void main(String[] args)
{
//The object has been created
WebCrowler web_crowler
= new WebCrowler();
String root
= "https:// www.google.com";
web_crowler.discover(root);
}
}
Output
webpage accessed: https://www.google.com
webpage accessed: https://www.facebook.com
webpage accessed: https://www.amazon.com
webpage accessed: https://www.microsoft.com
webpage accessed: https://www.apple.com