When we hear about that opaque area of the Internet known as the Deep Web, we inevitably think of the negative aspects that we imagine the exchange of information of any origin and nature without the transparency that we expect. Actually, this Deep Web also refers to all the content that is not indexed by the usual search engines of all known.
It is estimated that around 95% of all internet content is material other than text, which is what is easily indexable.
Thousands of different types of non-text data such as video, audio or images without the information necessary for them to be located by Google. On other occasions, the user’s prior registration requirements to access the information prevent that search. This is a basic principle of privacy because, for example, the content of Gmail messages or documents in Dropbox, although they are visible to any registered user, are not accessible by public search engines. Although, and that is why contextual advertising works, it is visible to robots. There are therefore infinity of opaque contents that have these characteristics:
- They require the user to complete access data.
- They involve dynamic content such as AJAX or JavaScript.
- They contain images or other non-indexable information.
For Cristiano Mattman, NASA Data Scientist Director, this does not allow them to be defined as the Deep Web, but must be hosted on web servers that use the anonymous network protocol called Tor. This protocol was created with good intentions, by the US Department of Defense to protect sensitive information and was published as Public Domain in 2004. The problem came when organizations and individuals with less noble intentions began to use its possibilities for drug trafficking, weapons or people.
Seeking the door from the deep web
In 2014, the US government launched the Memex program in order to help the police identify criminal operations online, within the Deep Web, through data mining. Using this tool to monitor the deep web on an ongoing basis could help identify human and arms trafficking situations soon after photos are posted online. That could prevent crime from occurring and save lives.
[youtube]https://youtu.be/9QsjkJcUznA[/youtube]
Benefits of developing tools that limit web opacity
Paradoxically, research into technologies that break the limits of search engines will benefit the development of future search engines that we will all use. The technologies developed in the program would provide the mechanisms to improve content discovery, information extraction, information retrieval, user collaboration and other key search functions. Specifically, from Memexit is expected to reach:
- Developing the next generation of search technologies to revolutionize the discovery, organization, and presentation of domain-specific content
- Creating a new domain-specific search paradigm to discover relevant content and organize it in ways that make it more immediately useful for specific tasks.
- Extending current search capabilities to deep web and non-traditional content.
- Improved interfaces for military, civil servants, and commercial companies to find and organize publicly available information on the Internet.
For example, internet searches are still largely a manual process it doesn’t save sessions, requires an almost exact word input one at a time, and doesn’t organize or aggregate results beyond a list of links. New search engines based on Memex They promise to fix it.
Additionally, the creation of new search systems for complex information on the Internet, or in the Big Data of intranets, would facilitate the work of scientists of all specialties, who could track, index and correlate millions of graphic or graphic files without sufficient data to be located. Not to mention that all the code written for Memex, like TOR, is Open Source and dozens of independent teams are already working to make the most of its possibilities.
[youtube]https://www.youtube.com/watch?v=vObvEGtPHKo[/youtube]