How to Create your own SEARCH ENGINE

A DIY search engine that would be somewhat usable should work like this. I'm modelling this on a model I built of Google a few years ago.

Physical

Firstly, you need some servers. The amount will depend on how much content you plan to index

You'll need an internet connection

You should have a firewall

You should have a Web application firewall to protect your data Introducing KEMP’s LoadMaster-Integrated Web Application Firewall Services

You'll need a load balancer for your application servers - both the web publisher and possibly your database servers Load Balancers – bringing zen to your holistic SEO strategy

Applications - the search web applicaiton

Note: most people think that search and ranking happens when you do a search. Its actually already done. To make search so fast, the search layer just returns the next 10 results from a list starting at position 1, and the only logic applied is (a) geo specific pages are removed, (b) QDF checks pull in items based on a date and time range and (c) real time data is pulled from a queue which has it ready to go

Essentially you want to break the HTML user interface from the business logic and the databases (the index)

I would imagine a 3 tier where you have a very basic Apache (or IIS) web application that lets a user enter a search query and then it fires it the search application which sends the results to another 2nd User interface. That way should there be peak demand you can basically disconnect the user operation servers from the functioning search server

The business logic layer - the "search" should simply have 2 functions

Function A - strip the search string in a query of non-value words (a, the, an, and, it, is) where each word is searched and a second query where its earched in different %'s of completeness. E.g. a search for "how much does an iphone cost" would [much or does or iphone or cost] and also [much or does or iphone] and [much or does or cost]

The query is then handed to the next available search parser. The search parser connects to the next available database server. this is where load balancing comes in - if any single server becomes non-responsive, the load balancer just removes it from queue and connects the next available healthy server

The parser requests the top 10 results with those words and is given a list of key IDs which each match a url

The magic of rank order isn't established during search - search is the final leg

Applications - Crawl List

You need a URL list, which you would need to mark with last crawled dates, time to crawl/page speed, robots and meta index allowed status

You would also want to store a rank and geo-location and a language identifier

Your crawl servers do the clever work

The crawl list is made up from a multitude of sources including: crawling and processing files and pages found on a server, inside a file (i.e. links from other pages), from URL's accessed by a browser, toolbar app, or a submit page on your search app

Lastly, you would store a list matrix of the pages that link to this page and a vale of those pages which is in turn based on the number of links they have. you might also build in some kind of penalty/spam feature at this point

You would then have a realtime source list - e.g. news sites that you either allow a direct URL submit (e.g. look at pingomatic for WordPress: Ping-o-Matic!)

Remember submitted urls save time - so if you have a trust indicator or protocol established, then trusted urls would be auto-crawled super faster - becasuse there is less time spent on discovery. Once you crawl it, you mark it as crawled, thus if the url is found during another process,, well, unless its refresh-by date hasn't passed it doesn't need to be recrawled.

You would then have a scheduler which would be a view or table of that data in terms of which pages and domains are prioritised first

You would need multiple crawlers if you were trying to index the web or a large piece of it

some servers will work on a dedicated.