How to Create your own SEARCH ENGINE

A DIY search engine that would be somewhat usable should work like this. I'm modelling this on a model I built of Google a few years ago.


Physical
                                                                                                                                                                       Firstly, you need some servers. The amount will depend on how much content you plan to index
You'll need an internet connection
You should have a firewall
You should have a Web application firewall to protect your data Introducing KEMP’s LoadMaster-Integrated Web Application Firewall Services
You'll need a load balancer for your application servers - both the web publisher and possibly your database servers Load Balancers – bringing zen to your holistic SEO strategy
Applications - the search web applicaiton
Note: most people think that search and ranking happens when you do a search. Its actually already done. To make search so fast,  the search layer just returns the next 10 results from a list starting at position 1, and the only logic applied is (a) geo specific pages are removed, (b) QDF checks pull in items based on a date and time range and (c) real time data is pulled from a queue which has it ready to go
Essentially you want to break the HTML user interface from the business logic and the databases (the index)
I would imagine a 3 tier where you have a very basic Apache (or IIS) web application that lets a user enter a search query and then it fires it the search application which sends the results to another 2nd User interface. That way should there be peak demand you can basically disconnect the user operation servers from the functioning search server
The business logic layer - the "search" should simply have 2 functions
Function A - strip the search string in a query of non-value words (a, the, an, and, it, is) where each word is searched  and a second query where its earched in different %'s of completeness. E.g. a search for "how much does an iphone cost" would [much or does or iphone or cost] and also [much or does or iphone] and [much or does or cost]
The query is then handed to the next available search parser. The search parser connects to the next available database server. this is where load balancing comes in - if any single server becomes non-responsive, the load balancer just removes it from queue and connects the next available healthy server
The parser requests the top 10 results with those words and is given a list of key IDs which each match a url
The magic of rank order isn't established during search - search is the final leg



Applications - Crawl List

You need a URL list, which you would need to mark with last crawled dates, time to crawl/page speed, robots and meta index allowed status
You would also want to store a rank and geo-location and a language identifier
Your crawl servers do the clever work
The crawl list is made up from a multitude of sources including: crawling and processing files and pages found on a server, inside a file (i.e. links from other pages), from URL's accessed by a browser, toolbar app, or a submit page on your search app
Lastly, you would store a list matrix of the pages that link to this page and a vale of those pages which is in turn based on the number of links they have. you might also build in some kind of penalty/spam feature at this point
You would then have a realtime source list - e.g. news sites that you either allow a direct URL submit (e.g. look at pingomatic for WordPress: Ping-o-Matic!)
Remember submitted urls save time - so if you have a trust indicator or protocol established, then trusted urls would be auto-crawled super faster - becasuse there is less time spent on discovery. Once you crawl it, you mark it as crawled, thus if the url is found during another process,, well, unless its refresh-by date hasn't passed it doesn't need to be recrawled.
You would then have a scheduler which would be a view or table of that data in terms of which pages and domains are prioritised first
You would need multiple crawlers if you were trying to index the web or a large piece of it

some servers will work on a dedicated.
Share on Google Plus

About Unknown

0 comments:

Post a Comment

Note: only a member of this blog may post a comment.