LLM Based Web Crawler
It has been a while.
This is Tsubasa Kato of Inspire Search Corp.
Summer is almost over.
We are currently developing a web crawler using LLM at Inspire Search. We query words and sentences to LLM via API to get new concepts.
Our development environment uses a GeForce RTX 3090 to run LLM. (We bought it in Akihabara.)
This new concept refers to related words and sentences. Until now, we had to search for them in a database from crawling websites, but now the knowledge (parameters) in the LLM completes the process.
The advantage of this is that the cost of going to the web to retrieve information and the security or confidentiality issues are solved. (Data expansion / enhancement stays local)
For example, if a research and development company wants to crawl the Web for a lot of data on a certain topic, they no longer need to query A or B search engine companies directly and then crawl the Web. There must be some information that they do not want other companies to know from a competitive standpoint.
It’s like expanding the concept of the target you want to crawl first and then throwing a big rope to the web crawler.
Even if the crawler needs to contact a third-party search engine to get a seed URL (URL to start crawling), the concept (search term) is abstracted.
I have the unfinished code on GitHub right now. Please take a look if you like.
https://github.com/stingraze/llm-seed-url-generator
Tsubasa Kato 8/21/2023
Inspire Search Inc.
CEO
https://www.inspiresearch.io/en
This article was mostly translated from a Japanese article by me with the help of DeepL.
Update: 9/30/2023
I made a demo video of the LLM web crawler seed expander. You can view it from:
https://stingraze.medium.com/llm-llama-2-web-crawler-seed-expander-33c0af32b648