IBM releases data science and machine learning platform Cloud Private for Data

IBM is embracing artificial intelligence with the launch of IBM Cloud Private for Data. The platform consists of integrated data science, data engineering and app building services. According to IBM, it is designed to help organizations accelerate their AI journeys and increase productivity.


“Whether they are aware of it or not, every company is on a journey to AI as the ultimate driver of business transformation,” said Rob Thomas, general manager of IBM Analytics. “But for them to get there, they need to put in place an information architecture for collecting, managing and analyzing their data. With today’s announcements, we are planning to bring the AI destination closer and give access to powerful machine learning and data science technologies that can turn data into game-changing insight.”

The platform is powered by an in-memory database that is capable of ingesting and analyzing one million events per second, according to their internal testing. In addition, it is deployed on Kubernetes, allowing for a fully integrated development and data science environment, IBM explained. The company hopes it will provide organizations with access to data insights that were previously unobtainable, and allow users to exploit event-driven applications to gather and analyze data from IoT sensors, online commerce, mobile devices, and more.

READ ASLO: IBM releases WebSphere Liberty code to open source


Cloud Private for Data includes capabilities from IBM’s Data Science Experience, Information Analyzer, Information Governance Catalogue, Data Stage, Db2, and Db2 Warehouse. These capabilities will allow customers to gain insights from data stored in protected environments and make data-driven decisions. According to the company, the solution is meant to provide a data infrastructure layer for AI behind firewalls.

Going forward, IBM plans to have Cloud Private for Data run on all cloud and be available in industry-specific solutions for areas such as financial services, healthcare, and manufacturing.

As part of the launch, the company also announced the Data Science Elite Team, a no-charge consultancy team that will advise clients on machine learning adoption and assist them with their AI roadmaps.

Visual Studio Code will now ship with Anaconda

Microsoft has announced that Visual Studio Code will ship as part of the popular Python data science platform Anaconda. Microsoft first announced plans to bring Python to Azure Machine Learning, Visual Studio and SQL Server in September of last year.

According to Microsoft, “Visual Studio Code can easily be installed at the same time as Anaconda, providing a great editing and debugging experience for Python users, with special features tailor-made for Anaconda users.”

Microsoft has previously made investments in the Python community. It has already released a Python extension for VS Code and provides support for Python in Azure Machine Learning, SQL Server, and Azure Notebooks. According to Microsoft, the Microsoft Python Extension for Visual Studio Code is the most downloaded extension in the VS Code marketplace.

In addition, Microsoft created a team to support its Python extension, and will be extending that support for Anaconda environments as well.

According to the Anaconda team, VS Code is a good IDE choice for its users on Windows, macOS and Linux because of its debugging, code completion, and Git integration features. It also offers a number of extensions that developers can tailor to their specific needs.

“Anaconda, Inc. is excited to be able to make installation of Microsoft Visual Studio Code and the Python Extension for Visual Studio Code a more seamless experience for our Anaconda users,” Crystal Soja, product manager for the Anaconda Distribution and Anaconda Cloud, wrote in a post.

With great technology comes great risks. As new technology continues to emerge in this digital day and age, Carnegie Mellon University’s Software Engineering Institute (SEI) is taking a deeper look on the impact they will have. The institute has released its 2017 Emerging Technology Domains Risk report detailing future threats and vulnerabilities.

“To support the [Department of Homeland Security’s United States Computer Emergency Readiness Team] US-CERT mission of proactivity, the CERT Coordination Center located at Carnegie Mellon University’s Software Engineering Institute was tasked with studying emerging systemic vulnerabilities, defined as exposures or weaknesses in a system that arise due to complex or unexpected interactions between subcomponents. The CERT/CC researched the emerging technology trends through 2025 to assess the technology domains that will become successful and transformative, as well as the potential cybersecurity impact of each domain,” according to SEI’s report.

According to the report, the top technologies that pose a risk are:

  • Blockchain: Blockchain technology has become more popular over the past couple of years as companies are working to take the technology out of cryptocurrency and transform it into a business model. Gartner recently named blockchain as one of the top 10 technology trends for 2018. However, the report notes the technology comes with unique security challenges. “Since it is a tool for securing data, any programming bugs or security vulnerabilities in the blockchain technology itself would undermine its usability,” according to the report.
  • Intelligent transportation systems: It seems every day a new company is joining the autonomous vehicle race. The benefits of autonomous vehicles include safer roads and less traffic, but the report states that one malfunction could have unintended consequences such as traffic accidents, property damage, injury and even death.
  • Internet of Things mesh networks: With the emergence of the IoT, mesh networks have been established as a way for “things” to connect and pass data. The report notes that mesh networks carry the same risks as traditional wireless networking devices and access points such as spoofing, man in the middle attacks and reconnaissances. In addition, the mesh networks pose more risks due to device designs and implementations. “A single compromised device may become a staging point for attacks on every other node in the mesh as well as on home or business networks that act as Internet gateways,” the report states.
  • Machine learning: Machine learning provides the ability to add automation to big data and derive business insights faster, however the SEI worries about the security impact of vulnerabilities when sensitive information is involved. In addition, just as easy as it is to train machine learning algorithms on a body of data, it can be as easy to trick the algorithm also. “ The ability of an adversary to introduce malicious or specially crafted data for use by a machine learning algorithm may lead to inaccurate conclusions or incorrect behavior,” according to the report.
  • Robotic surgery: Robot-assisted surgery involves a surgeon, computer console and a robotic arm that typically performs autonomous procedures. While the technique has been well established, and the impact of security vulnerabilities have been low, the SEI still has its concerns. “Where surgical robots are networked, attacks—even inadvertent ones—on these machines may lead to unavailability, which can have downstream effects on patient scheduling and the availability of hospital staff,” according to the report.
  • Smart buildings: Smart buildings fall under the realm of the Internet of Things using sensors and data analytics to make building “efficient, comfortable, and safe.” Some examples of smart buildings include: real-time lighting adjustments, HVAC, and maintenance parameters. According to the SEI, the risks vary with the type of action. “The highest risks will involve safety- and security- related technologies, such as fire suppression, alarms, cameras, and access control. Security compromises in other systems may lead to business disruption or nothing more than mild discomfort. There are privacy implications both for businesses and individuals,” the wrote.
  • Smart robots: Smart robots are being used alongside or in place of human workers. With machine learning and artificial intelligence capabilities, these robots and learn, adapt and make decisions based on their environments. Their risk include, but are not limited to, hardware, operating system, software and interconnectivity. “ It is not difficult to imagine the financial, operational, and safety impact of shutting down or modifying the behavior of manufacturing robots, delivery drones; service-oriented or military humanoid robots; industrial controllers; or, as previously discussed, robotic surgeons,” according to the researchers.
  • Virtual personal assistants: Almost everyone has access to a virtual personal assistant either on their PC on mobile device. These virtual personal assistants use artificial intelligence and machine learning to understand a user and mimic skills of a human assistants. Since these assistants are highly reliant on data, the report states there is a privacy concern when it comes to security. “VPAs will potentially access users’ social network accounts, messaging and phone apps, bank accounts, and even homes. In business settings, they may have access to knowledge bases and a great deal of corporate data,” the researchers wrote.


According to the report, the top three domains that are the highest priority for outreach and analysis in 2017 are: intelligent transportation systems, machine learning and smart robots. “These three domains are being actively deployed and have the potential to have widespread impacts on society,” the report states.

Why Machine Learning Isn’t As Hard To Learn As You Think

Why is Machine Learning difficult to understand? originally appeared on Quorathe place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by John L. Miller, Industry ML experience with video, sensor data, images. PhD. Microsoft, Google, on Quora:

I’m usually the first person to say something is hard, but I’m not going to here. Learning how to use machine learning isn’t any harder than learning any other set of libraries for a programmer.

The key is to focus on using it, not designing the algorithm. Look at it this way: if you need to sort data, you don’t invent a sort algorithm, you pick an appropriate algorithm and use it right.

It’s the same thing with machine learning. You don’t need to learn how the guts of the machine learning algorithm works. You need to learn what the main choices are (e.g. neural nets, random decision forests…), how to feed them data, and how to use the data produced.

There is a bit of an art to it: deciding when you can and can’t use machine learning, and figuring out the right data to feed into it. For example, if you want to know whether a movie shows someone running, you might want to send both individual frames, and sets of frame deltas a certain number of seconds apart.

If you’re a programmer and it’s incredibly hard to learn ML, you’re probably trying to learn the wrong things about it.

This question originally appeared on Quora – the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Microsoft and Amazon announce deep learning library Gluon

Microsoft has announced a new partnership with Amazon to create a open-source deep learning library called Gluon. The idea behind Gluon is to make artificial intelligence more accessible and valuable.

According to Microsoft, the library simplifies the process of making deep learning models and will enable developers to run multiple deep learning libraries. This announcement follows their introduction of the Open Neural Network Exchange (ONNX) format, which is another AI ecosystem.

Gluon supports symbolic and imperative programming, which is something not supported by many other toolkits, Microsoft explained. It also will support hybridization of code, allowing compute graphs to be cached and reused in future iterations. It offers a layers library that reuses pre-built building blocks to define model architecture. Gluon natively supports loops and ragged tensors, allowing for high execution efficiency for RNN and LSTM models, as well as supporting sparse data and operations. It also provides the ability to do advanced scheduling on multiple GPUs.

“This is another step in fostering an open AI ecosystem to accelerate innovation and democratization of AI-making it more accessible and valuable to all,” Microsoft wrote in a blog post. “With Gluon, developers will be able to deliver new and exciting AI innovations faster by using a higher-level programming model and the tools and platforms they are most comfortable with.”

The library will be available for Apache MXNet or Microsoft Cognitive Toolkit. It is already available on GitHub for Apache MXNet, with Microsoft Cognitive Toolkit support on the way.

Gartner’s top 10 technology trends for 2018

With only a couple more months left of the year, Gartner is already looking ahead to the future. The organization announced its annual top strategic technology trends at the Gartner Symposium/ITxpo this week.

The basis of Gartner’s trends depends on whether or not they have the potential to disrupt the industry, and break out into something more impactful.

The top 10 strategic technology trends, according to Gartner, are:

    1. AI foundation: Last year, the organization included artificial intelligence and machine learning as its own trend on the list, but with AI and machine learning becoming more advance, Gartner is looking at how the technology will be integrated over the next five years. “AI techniques are evolving rapidly and organizations will need to invest significantly in skills, processes and tools to successfully exploit these techniques and build AI-enhanced systems,” said David Cearley, vice president and Gartner Fellow. “Investment areas can include data preparation, integration, algorithm and training methodology selection, and model creation. Multiple constituencies including data scientists, developers and business process owners will need to work together.”
    2. Intelligent apps and analytics: Continuing with its AI and machine learning theme, Gartner predicts new intelligent solutions that change the way people interact with systems, and transform the way they work.
    3. Intelligent things: Last in the AI technology trend area is intelligent things. According to Gartner, these go beyond rigid programming models and exploit AI to provide more advanced behaviors and interactions between people and their environment. Such solutions include: autonomous vehicles, robots and drones as well as the extension of existing Internet of Things solutions.
    4. Digital twin: A digital twin is a digital representation of real-world entities or systems, Gartner explains. “Over time, digital representations of virtually every aspect of our world will be connected dynamically with their real-world counterpart and with one another and infused with AI-based capabilities to enable advanced simulation, operation and analysis,” said Cearley. “City planners, digital marketers, healthcare professionals and industrial planners will all benefit from this long-term shift to the integrated digital twin world.”
    5. Cloud to the edge: Internet in the Internet of Things has brought up the notion of edge computing. According to Gartner, Edge computing is a form of computing topology that processes, collects and delivers information closer to its source. “When used as complementary concepts, cloud can be the style of computing used to create a service-oriented model and a centralized control and coordination structure with edge being used as a delivery style allowing for disconnected or distributed process execution of aspects of the cloud service,” said Cearley.
    6. Conversational platforms: Conversational platforms such as chatbots are transforming how humans interact with the emerging digital world. This new platform will be in the form of question and command experiences where a user asks a question and the platform is there able to respond.
    7. Immersive experience: In addition to conversational platforms, experiences such as virtual, augmented and mixed reality will also change how humans interact and perceive the world. Outside of video games and videos, businesses can use immersive experience to create real-life scenarios and apply it to design, training and visualization processes, according to Gartner.
    8. Blockchain: Once again, blockchains makes the list for its evolution into a digital transformation platform. In addition to the financial services industry, Gartner sees blockchains being used in a number of different apps such as government, healthcare, manufacturing, media distribution, identity verification, title registry, and supply chain.
    9. Event driven: New to this year’s list is the idea that the business is always looking for new digital business opportunities. “A key distinction of a digital business is that it’s event-centric, which means it’s always sensing, always ready and always learning,” saidYefim Natis, vice president, distinguished analyst and Gartner Fellow. “That’s why application leaders guiding a digital transformation initiative must make ‘event thinking’ the technical, organizational and cultural foundation of their strategy.”
    10. Continuous adaptive risk and trust: Lastly, the organization sees digital business initiatives adopting a continuous adaptive risk and trust assessment (CARTA) model as security becomes more important in a digital world. CARTA enables businesses to provide real-time, risk and trust-based decision making, according to Gartner.

“Gartner’s top 10 strategic technology trends for 2018 tie into the Intelligent Digital Mesh. The intelligent digital mesh is a foundation for future digital business and ecosystems,” said Cearley. “IT leaders must factor these technology trends into their innovation strategies or risk losing ground to those that do.”

To compare, last year’s trends are available here.

In addition, the organization also announced top predictions for IT organizations and users over the next couple of years. The predictions include: early adopters of visual and voice search will see an increase in digital commerce revenue by 30% by 2021; five of the top seven digital giants (Alibaba, Amazon, Apple, Baidu, Facebook, Google, Microsoft and Tencent) will willfully self-disrupt by 2020; and IoT technology will be in 95% of electronics by 2020.

The modern digital enterprise collects data on an unprecedented scale. Andrew Ng, currently at startup, formerly chief scientist at Chinese internet giant Baidu and co-founder of education startup Coursera, says, like electricity 100 years ago, “AI will change pretty much every major industry.” Machine Learning (ML) is a popular application of AI that refers to the use of algorithms that iteratively learn from data. ML, at its best, allows companies to find hidden insights in data without explicitly programming where to look.

Applications built based on ML are proliferating quickly. The list of well-known uses is long and growing every day. Apple’s Siri, Amazon’s recommendation engine, and IBM’s Watson are just a few prominent examples. All of these applications sift through incredible amounts of data and provide insights mapped to users’ needs.

Why is ML exploding in popularity? It is because the foundational technology in ML is openly available and accessible to organizations without specialized skill sets. Open source provides key technologies that make ML easy to learn, integrate and deploy into existing applications. This has lowered the barrier to entry and quickly opened ML to a much larger audience.

In the past two years, there has been an explosion of projects and development tools. The vast majority of consequential ones are open source. TensorFlow, just one key example, is a powerful system for building and training neural networks to detect and decipher patterns and correlations, similar to human learning and reasoning. It was open-sourced by Google at the end of 2015.

Main Languages for ML – Open Source Dominates

Open source programming languages are extremely popular in ML due to widespread adoption, supportive communities, and advantages for quick prototyping and testing.

For application languages, Python has a clear lead with interfaces and robust tools for almost all ML packages. Python has the added benefit of practically ubiquitous popularity. It is easy to integrate with applications and provides a wide ecosystem of libraries for web development, microservices, games, UI, and more.

Beyond Python, other open-source languages used in ML include R, Octave, and Go, with more coming along. Some of these, like R and Octave, are statistical languages that have a lot of the tools for working with data analysis and working within a sandbox. Go, developed and backed by Google, is new and is an excellent server and systems language with a growing library of data science tools. Its advantages include compiled code and speed. Its adoption rates are increasing dramatically.

Python Tools and Libraries for ML – An Introduction

The amazing strength of open source is in the proliferation of powerful tools and libraries that get you up and running quickly. At the core of the Python numerical/scientific computing ecosystem are NumPy and SciPy. NumPy and SciPy are foundational libraries on top of which many other ML and data science packages are built. NumPy provides support for numerical programming in Python. NumPy has been in development since 2006 and just received US$645,000 in funding this summer.

SciKit-Learn, with 20k stars and 10.7k forks, provides simple and efficient tools for data mining and data analysis. It is accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib, SciKit-Learn is very actively maintained and supports a wide variety of the most common algorithms including Classification, Regression, Clustering, Dimensionality Reduction, Model Selection, and Preprocessing. This is open source that is immediately ready for commercial implementation.

Keras is a Python Deep Learning library that allows for easy and fast prototyping and does not need significant ML expertise. It has been developed with a focus on enabling fast experimentation and being able to go from idea to result with the least possible delay. Keras can use TensorFlow, Microsoft Cognitive Toolkit (CNTK) or Theano as its backend, and you can swap between the three. Keras has 17.7k stars and 6.3k forks. Keras supports both convolutional networks and recurrent networks, as well as combinations of the two, and runs seamlessly on CPU and GPU.

TensorFlow is Google’s library for ML, which expresses calculations as a computation graph. With 64k stars and 31k forks, it is possibly one of the most popular projects on all GitHub and is becoming the standard intermediate format for many ML projects. Python is the recommended language by Google, though there are other language bindings.

These three superstar foundational ML tools are all open source and represent just a taste of the many important applications available to companies building ML strategies.

The Importance of ML Open Source Communities

Open source is built by communities that connect developers, users and enthusiasts in a common endeavor. Developers get useful examples and a feeling that others are extending the same topics. Communities provide examples, support and motivation that proprietary tools often lack. This also lowers the barrier to entry. Plus, many active ML communities are backed by large players like Google, Microsoft, Apple, Amazon, Apache and more.

Ask a software engineer: “How would you add search functionality to your product?” or “How do I build a search engine?” You’ll probably immediately hear back something like: “Oh, we’d just launch an ElasticSearch cluster. Search is easy these days.”

But is it? Numerous current products still have suboptimal search experiences. Any true search expert will tell you that few engineers have a very deep understanding of how search engines work, knowledge that’s often needed to improve search quality.
Even though many open source software packages exist, and the research is vast, the knowledge around building solid search experiences is limited to a select few. Ironically, searching online for search-related expertise doesn’t yield any recent, thoughtful overviews.
Emoji Legend
❗ “Serious” gotcha: consequences of ignorance can be deadly
🔷 Especially notable idea or piece of technology
☁️ ️Cloud/SaaS
🍺 Open source / free software
🦏 JavaScript
🐍 Python
☕ Java
🇨 C/C++
Why read this?
Think of this post as a collection of insights and resources that could help you to build search experiences. It can’t be a complete reference, of course, but hopefully we can improve it based on feedback (please comment or reach out!).
I’ll point at some of the most popular approaches, algorithms, techniques, and tools, based on my work on general purpose and niche search experiences of varying sizes at Google, Airbnb and several startups.

❗️Not appreciating or understanding the scope and complexity of search problems can lead to bad user experiences, wasted engineering effort, and product failure.

If you’re impatient or already know a lot of this, you might find it useful to jump ahead to the tools and services sections.
Some philosophy
This is a long read. But most of what we cover has four underlying principles:
🔷 Search is an inherently messy problem:
Queries are highly variable. The search problems are highly variable based on product needs.
Think about how different Facebook search (searching a graph of people).
YouTube search (searching individual videos).
Or how different both of those are are from Kayak (air travel planning is a really hairy problem).
Google Maps (making sense of geo-spacial data).
Pinterest (pictures of a brunch you might cook one day).
Quality, metrics, and processes matter a lot:
There is no magic bullet (like PageRank) nor a magic ranking formula that makes for a good approach. Processes are always evolving collection of techniques and processes that solve aspects of the problem and improve overall experience, usually gradually and continuously.
❗️In other words, search is not just just about building software that does ranking or retrieval (which we will discuss below) for a specific domain. Search systems are usually an evolving pipeline of components that are tuned and evolve over time and that build up to a cohesive experience.
In particular, the key to success in search is building processes for evaluation and tuning into the product and development cycles. A search system architect should think about processes and metrics, not just technologies.
Use existing technologies first:
As in most engineering problems, don’t reinvent the wheel yourself. When possible, use existing services or open source tools. If an existing SaaS (such as Algolia or managed Elasticsearch) fits your constraints and you can afford to pay for it, use it. This solution will likely will be the best choice for your product at first, even if down the road you need to customize, enhance, or replace it.
❗️Even if you buy, know the details:
Even if you are using an existing open source or commercial solution, you should have some sense of the complexity of the search problem and where there are likely to be pitfalls.
Theory: the search problem
Search is different for every product, and choices depend on many technical details of the requirements. It helps to identify the key parameters of your search problem:
Size: How big is the corpus (a complete set of documents that need to be searched)? Is it thousands or billions of documents?
Media: Are you searching through text, images, graphical relationships, or geospatial data?
🔷 Corpus control and quality: Are the sources for the documents under your control, or coming from a (potentially adversarial) third party? Are all the documents ready to be indexed or need to be cleaned up and selected?
Indexing speed: Do you need real-time indexing, or is building indices in batch is fine?
Query language: Are the queries structured, or you need to support unstructured ones?
Query structure: Are your queries textual, images, sounds? Street addresses, record ids, people’s faces?
Context-dependence: Do the results depend on who the user is, what is their history with the product, their geographical location, time of the day etc?
Suggest support: Do you need to support incomplete queries?
Latency: What are the serving latency requirements? 100 milliseconds or 100 seconds?
Access control: Is it entirely public or should users only see a restricted subset of the documents?
Compliance: Are there compliance or organizational limitations?
Internationalization: Do you need to support documents with multilingual character sets or Unicode? (Hint: Always use UTF-8 unless you really know what you’re doing.) Do you need to support a multilingual corpus? Multilingual queries?
Thinking through these points up front can help you make significant choices designing and building individual search system components.

A production indexing pipeline.
Theory: the search pipeline
Now let’s go through a list of search sub-problems. These are usually solved by separate subsystems that form a pipeline. What that means is that a given subsystem consumes the output of previous subsystems, and produces input for the following subsystems.
This leads to an important property of the ecosystem: once you change how an upstream subsystem works, you need to evaluate the effect of the change and possibly change the behavior downstream.

Here are the most important problems you need to solve:
Index selection:
given a set of documents (e.g. the entirety of the Internet, all the Twitter posts, all the pictures on Instagram), select a potentially smaller subset of documents that may be worthy for consideration as search results and only include those in the index, discarding the rest. This is done to keep your indexes compact, and is almost orthogonal to selecting the documents to show to the user. Examples of particular classes of documents that don’t make the cut may include:
oh, all the different shapes and sizes of search spam! A giant topic in itself, worthy of a separate guide. A good web spam taxonomy overview.
Undesirable documents:
domain constraints might require filtering: porn, illegal content, etc. The techniques are similar to spam filtering, probably with extra heuristics.
Or near-duplicates and redundant documents. Can be done with Locality-sensitive hashing, similarity measures, clustering techniques or even clickthrough data. A good overview of techniques.
Low-utility documents:
The definition of utility depends highly on the problem domain, so it’s hard to recommend the approaches here. Some ideas are: it might be possible to build a utility function for your documents; heuristics might work, or example an image that contains only black pixels is not a useful document; utility might be learned from user behavior.
Index construction:
For most search systems, document retrieval is performed using an inverted index — often just called the index.
The index is a mapping of search terms to documents. A search term could be a word, an image feature or any other document derivative useful for query-to-document matching. The list of the documents for a given term is called a posting list. It can be sorted by some metric, like document quality.
Figure out whether you need to index the data in real time.❗️Many companies with large corpora of documents use a batch-oriented indexing approach, but then find this is unsuited to a product where users expect results to be current.
With text documents, term extraction usually involves using NLP techniques, such as stop lists, stemming and entity extraction; for images or videos computer vision methods are used etc.
In addition, documents are mined for statistical and meta information, such as references to other documents (used in the famous PageRank ranking signal), topics, counts of term occurrences, document size, entities A mentioned etc. That information can be later used in ranking signal construction or document clustering. Some larger systems might contain several indexes, e.g. for documents of different types.
Index formats. The actual structure and layout of the index is a complex topic, since it can be optimized in many ways. For instance there are posting lists compression methods, one could target mmap()able data representation or use LSM-tree for continuously updated index.
Query analysis and document retrieval:
Most popular search systems allow non-structured queries. That means the system has to extract structure out of the query itself. In the case of an inverted index, you need to extract search terms using NLP techniques.
The extracted terms can be used to retrieve relevant documents. Unfortunately, most queries are not very well formulated, so it pays to do additional query expansion and rewriting, like:
Term re-weighting.
Spell checking. Historical query logs are very useful as a dictionary.
Synonym matching. Another survey.
Named entity recognition. A good approach is to use HMM-based language modeling.
Query classification. Detect queries of particular type. For example, Google Search detects queries that contain a geographical entity, a porny query, or a query about something in the news. The retrieval algorithm can then make a decision about which corpora or indexes to look at.
Expansion through personalization or local context. Useful for queries like “gas stations around me”.
Given a list of documents (retrieved in the previous step), their signals, and a processed query, create an optimal ordering (ranking) for those documents.
Originally, most ranking models in use were hand-tuned weighted combinations of all the document signals. Signal sets might include PageRank, clickthrough data, topicality information and others.
To further complicate things, many of those signals, such as PageRank, or ones generated by statistical language models contain parameters that greatly affect the performance of a signal. Those have to be hand-tuned too.
Lately, 🔷 learning to rank, signal-based discriminative supervised approaches are becoming more and more popular. Some popular examples of LtR are McRank and LambdaRank from Microsoft, and MatrixNet from Yandex.
A new, vector space based approach for semantic retrieval and ranking is gaining popularity lately. The idea is to learn individual low-dimensional vector document representations, then build a model which maps queries into the same vector space.
Then, retrieval is just finding several documents that are closest by some metric (e.g. Eucledian distance) to the query vector. Ranking is the distance itself. If the mapping of both the documents and queries is built well, the documents are chosen not by a fact of presence of some simple pattern (like a word), but how close the documents are to the query by meaning.
Indexing pipeline operation
Usually, each of the above pieces of the pipeline must be operated on a regular basis to keep the search index and search experience current.
❗️Operating a search pipeline can be complex and involve a lot of moving pieces. Not only is the data moving through the pipeline, but the code for each module and the formats and assumptions embedded in the data will change over time.
A pipeline can be run in “batch” or based on a regular or occasional basis (if indexing speed does not need to be real time) or in a streamed way (if real-time indexing is needed) or based on certain triggers.
Some complex search engines (like Google) have several layers of pipelines operating on different time scales — for example, a page that changes often (like is indexed with a higher frequency than a static page that hasn’t changed in years.
Serving systems
Ultimately, the goal of a search system is to accept queries, and use the index to return appropriately ranked results. While this subject can be incredibly complex and technical, we mention a few of the key aspects to this part of the system.
Performance: users notice when the system they interact with is laggy. ❗️Google has done extensive research, and they have noticed that number of searches falls 0.6%, when serving is slowed by 300ms. They recommend to serve results under 200 ms for most of your queries. A good article on the topic. This is the hard part: the system needs to collect documents from, possibly, many computers, than merge them into possible a very long list and then sort that list in the ranking order. To complicate things further, ranking might be query-dependent, so, while sorting, the system is not just comparing 2 numbers, but performing computation.
🔷 Caching results: is often necessary to achieve decent performance. ❗️ But caches are just one large gotcha. The might show stale results when indices are updated or some results are blacklisted. Purging caches is a can of warm of itself: a search system might not have the capacity to serve the entire query stream with an empty (cold) cache, so the cache needs to be pre-warmed before the queries start arriving. Overall, caches complicate a system’s performance profile. Choosing a cache size and a replacement algorithm is also a challenge.
Availability: is often defined by an uptime/(uptime + downtime) metric. When index is distributed, in order to serve any search results, the system often needs to query all the shards for their share of results. ❗️That means, that if one shard is unavailable, the entire search system is compromised. The more machines are involved in serving the index — the higher the probability of one of them becoming defunct and bringing the whole system down.
Managing multiple indices: Indices for large systems may separated into shards (pieces) or divided by media type or indexing cadence (fresh versus long-term indices). Results can then be merged.
Merging results of different kinds: e.g. Google showing results from Maps, News etc.

A human rater. Yeah, you should still have those.
Quality, evaluation, and improvement
So you’ve launched your indexing pipeline and search servers, and it’s all running nicely. Unfortunately the road to a solid search experience only begins with running infrastructure.
Next, you’ll need to build a set of processes around continuous search quality evaluation and improvement. In fact, this is actually most of the work and the hardest problem you’ll have to solve.

🔷 What is quality? First, you’ll need to determine (and get your boss or the product lead to agree), what quality means in your case:
Self-reported user satisfaction (includes UX)
Perceived relevance of the returned results (not including UX)
Satisfaction relative to competitors
Satisfaction relative performance of the previous version of the search engine (e.g. last week)
User engagement
Metrics: Some of these concepts can be quite hard to quantify. On the other hand, it’s incredibly useful to be able to express how well a search engine is performing in a single number, a quality metric.
Continuously computing such a metric for your (and your competitors’) system you can both track your progress and explain how well you are doing to your boss. Here are some classical ways to quantify quality, that can help you construct your magic quality metric formula:
Precision and recall measure how well the retrieved set of documents corresponds to the set you expected to see.
F score (specifically F1 score) is a single number, that represents both precision and recall well.
Mean Average Precision (MAP) allows to quantify the relevance of the top returned results.
🔷 Normalized Discounted Cumulative Gain (nDCG) is like MAP, but weights the relevance of the result by its position.
Long and short clicks — Allow to quantify how useful the results are to the real users.
A good detailed overview.
🔷 Human evaluations: Quality metrics might seem like statistical calculations, but they can’t all be done by automated calculations. Ultimately, metrics need to represent subjective human evaluation, and this is where a “human in the loop” comes into play.
❗️Skipping human evaluation is probably the most spread reason of sub-par search experiences.

Usually, at early stages the developers themselves evaluate the results manually. At later point human raters (or assessors) may get involved. Raters typically use custom tools to look at returned search results and provide feedback on the quality of the results.
Subsequently, you can use the feedback signals to guide development, help make launch decisions or even feed them back into the index selection, retrieval or ranking systems.

Here is the list of some other types of human-driven evaluation, that can be done on a search system:
Basic user evaluation: The user ranks their satisfaction with the whole experience
Comparative evaluation: Compare with other search results (compare with search results from earlier versions of the system or competitors)
Retrieval evaluation: The query analysis and retrieval quality is often evaluated using manually constructed query-document sets. A user is shown a query and the list of the retrieved documents. She can then mark all the documents that are relevant to the query, and the ones that are not. The resulting pairs of (query, [relevant docs]) are called a “golden set”. Golden sets are remarkably useful. For one, an engineer can set up automatic retrieval regression tests using those sets. The selection signal from golden sets can also be fed back as ground truth to term re-weighting and other query re-writing models.
Ranking evaluation: Raters are presented with a query and two documents side-by-side. The rater must choose the document that fits the query better. This creates a partial ordering on the documents for a given query. That ordering can be later be compared to the output of the ranking system. The usual ranking quality measures used are MAP and nDCG.
Evaluation datasets:
One should start thinking about the datasets used for evaluation (like “golden sets” mentioned above) early in the search experience design process. How you collect and update them? How you push them to the production eval pipeline? Is there a built-in bias?

Live experiments:
After your search engine catches on and gains enough users, you might want to start conducting live search experiments on a portion of your traffic. The basic idea is to turn some optimization on for a group of people, and then compare the outcome with that of a “control” group — a similar sample of your users that did not have the experiment feature on for them. How you would measure the outcome is, once again, very product specific: it could be clicks on results, clicks on ads etc.

Evaluation cycle time: How fast you improve your search quality is directly related to how fast you can complete the above cycle of measurement and improvement. It is essential from the beginning to ask yourself, “how fast can we measure and improve our performance?”
Will it take days, hours, minutes or seconds to make changes and see if they improve quality? ❗️Running evaluation should also be as easy as possible for the engineers and should not take too much hands-on time.
🔷 So… How do I PRACTICALLY build it?
This blogpost is not meant as a tutorial, but here is a brief outline of how I’d approach building a search experience right now:
As was said above, if you can afford it — just buy the existing SaaS (some good ones are listed below). An existing service fits if:
Your experience is a “connected” one (your service or app has internet connection).
Does it support all the functionality you need out of box? This post gives a pretty good idea of what functions would you want. To name a few, I’d at least consider: support for the media you are searching; real-time indexing support; query flexibility, including context-dependent queries.
Given the size of the corpus and the expected QpS, can you afford to pay for it for the next 12 months?
Can the service support your expected traffic within the required latency limits? In case when you are querying the service from an app, make sure that the given service is accessible quickly enough from where your users are.
2. If a hosted solution does not fit your needs or resources, you probably want to use one of the open source libraries or tools. In case of connected apps or websites, I’d choose ElasticSearch right now. For embedded experiences, there are multiple tools below.
3. You most likely want to do index selection and clean up your documents (say extract relevant text from HTML pages) before uploading them to the search index. This will decrease the index size and make getting to good results easier. If your corpus fits on a single machine, just write a script (or several) to do that. If not, I’d use Spark.

You can never have too many tools.
☁️ SaaS
☁️ 🔷Algolia — a proprietary SaaS that indexes a client’s website and provides an API to search the website’s pages. They also have an API to submit your own documents, support context dependent searches and serve results really fast. If I were building a web search experience right now and could afford it, I’d probably use Algolia first — and buy myself time to build a comparable search experience.
Various ElasticSearch providers: AWS (☁️ ElasticSearch Cloud), ☁️ and from ☁️ Qbox.
☁️ Azure Search — a SaaS solution from Microsoft. Accessible through a REST API, it can scale to billions of documents. Has a Lucene query interface to simplify migrations from Lucene-based solutions.
☁️ Swiftype — an enterprise SaaS that indexes your company’s internal services, like Salesforce, G Suite, Dropbox and the intranet site.
Tools and libraries
🍺☕🔷 Lucene is the most popular IR library. Implements query analysis, index retrieval and ranking. Either of the components can be replaced by an alternative implementation. There is also a C port — 🍺Lucy.
🍺☕🔷 Solr is a complete search server, based on Lucene. It’s a part of the Hadoop ecosystem of tools.
🍺☕🔷 Hadoop is the most widely used open source MapReduce system, originally designed as a indexing pipeline framework for Solr. It has been gradually loosing ground to 🍺Spark as the batch data processing framework used for indexing. ☁️EMR is a proprietary implementation of MapReduce on AWS.
🍺☕🔷 ElasticSearch is also based on Lucene (feature comparison with Solr). It has been getting more attention lately, so much that a lot of people think of ES when they hear “search”, and for good reasons: it’s well supported, has extensive API, integrates with Hadoop and scales well. There are open source and Enterprise versions. ES is also available as a SaaS on Can scale to billions of documents, but scaling to that point can be very challenging, so typical scenario would involve orders of magnitude smaller corpus.
🍺🇨 Xapian — a C++-based IR library. Relatively compact, so good for embedding into desktop or mobile applications.
🍺🇨 Sphinx — an full-text search server. Has a SQL-like query language. Can also act as a storage engine for MySQL or used as a library.
🍺☕ Nutch — a web crawler. Can be used in conjunction with Solr. It’s also the tool behind 🍺Common Crawl.
🍺🦏 Lunr — a compact embedded search library for web apps on the client-side.
🍺🦏 searchkit — a library of web UI components to use with ElasticSearch.
🍺🦏 Norch — a LevelDB-based search engine library for Node.js.
🍺🐍 Whoosh — a fast, full-featured search library implemented in pure Python.
OpenStreetMaps has it’s own 🍺deck of search software.
A few fun or useful data sets to try building a search engine or evaluating search engine quality:
🍺🔷 Commoncrawl — a regularly-updated open web crawl data. There is a mirror on AWS, accessible for free within the service.
🍺🔷 Openstreetmap data dump is a very rich source of data for someone building a geospacial search engine.
🍺 Google Books N-grams can be very useful for building language models.
🍺 Wikipedia dumps are a classic source to build, among other things, an entity graph out of. There is a wide range of helper tools available.
IMDb dumps are a fun dataset to build a small toy search engine for.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto is a good, deep academic treatment of the subject. This is a good overview for someone completely new to the topic.
Information Retrieval by S. Büttcher, C. Clarke and G. Cormack is another academic textbook with a wide coverage and is more up-to-date. Covers learn-to-rank and does a pretty good job at discussing theory of search systems evaluation. Also is a good overview.
Learning to Rank by T-Y Liu is a best theoretical treatment of LtR. Pretty thin on practical aspects though. Someone considering building an LtR system should probably check this out.
Managing Gigabytes — published in 1999, is still a definitive reference for anyone embarking on building an efficient index of a significant size.
Text Retrieval and Search Engines — a MOOC from Coursera. A decent overview of basics.
Indexing the World Wide Web: The Journey So Far (PDF), an overview of web search from 2012, by Ankit Jain and Abhishek Das of Google.
Why Writing Your Own Search Engine is Hard a classic article from 2004 from Anna Patterson. — a curated list of search-related resources.
A great blog on everything search by Daniel Tunkelang.
Some good slides on search engine evaluation.
This concludes my humble attempt to make a somewhat-useful “map” for an aspiring search engine engineer. Did I miss something important? I’m pretty sure I did — you know, the margin is too narrow to contain this enormous topic. Let me know if you think that something should be here and is not.
P.S. — This post is part of a open, collaborative effort to build an online reference, the Open Guide to Practical AI, which we’ll release in draft form soon. See this popular guide for an example of what’s coming.

Facebook, IBM, Microsoft lead advances in AI

The MIT-IBM Watson AI Lab is focused on fundamental artificial intelligence (AI) research with the goal of propelling scientific breakthroughs that unlock the potential of AI.

Artificial intelligence and machine learning are playing larger roles in software, from data consumption and analysis to test automation and user experience. These cognitive services will drive the next wave of technology innovation. And industry heavyweights Facebook,  IBM and Microsoft are leading the charge with new investments for innovation.

IBM yesterday announced plans to create an AI research partnership with the Massachusetts Institute of Technology to unlock AI’s potential by advancing hardware, software and algorithms around deep learning, the company said in the announcement.

IBM will make a 10-year, $240 million commitment to the MIT-IBM Watson AI Lab, which will be located in Cambridge, Mass., where IBM has a research lab and where MT’s campus is located. Dario Gil, IBM Research VP of AI, and Dean Anantha P. Chandrakasan of MIT’s School of Engineering, will co-chair the new lab. The project will draw from the expertise of more than 100 AI scientists and MIT professors and students.

“The field of artificial intelligence has experienced incredible growth and progress over the past decade. Yet today’s AI systems, as remarkable as they are, will require new innovations to tackle increasingly difficult real-world problems to improve our work and lives,” said Dr. John Kelly III, IBM senior vice president, Cognitive Solutions and Research, in a statement. “The extremely broad and deep technical capabilities and talent at MIT and IBM are unmatched, and will lead the field of AI for at least the next decade.”

Among the efforts the lab team will pursue are creating AI algorithms that can tackle more complex problems, understanding the physics of AI, how AI applies to vertical industries, and delivering societal and economic benefits through AI.

Meanwhile, Microsoft yesterday announced the Open Neural Network Exchange in conjunction with Facebook. Microsoft’s Cognitive Toolkit, along with Caffe2 and PyTorch, will all support the open-source ONNX.


According to Microsoft’s announcement, the ONNX representation of neural networks will provide framework interoperability, allowing developers to use their preferred tools while moving between frameworks. ONNX also offers shared optimization, so organizations looking to improve the performance of their neural networks can do so to multiple frameworks at once by simply targeting the ONNX representation.

ONNX, the announcement explained, “provides a definition of an extensible computation graph model, as well as definitions of built-in operators and standard data types.” Initially, the project is focused on inferencing capabilities.

ONNX code and documentation are available on GitHub.

Digital operations management company PagerDuty is using machine learning and advanced response automation to help businesses orchestrate the correct response to any situation. Among the new capabilities in PagerDuty’s platform are the ability to group related alerts to provide context, the ability to recognize similar incidents with the context of who dealt with the similar issue in the past and what steps were taken to resolve it, the ability to design automated response patterns, and more.

“Today’s dynamic digital business climate has exponentially increased both opportunity for growth and downside risks to mitigate. The latest Digital Operations Management capabilities announced [yesterday] – machine learning and automation – tackle the real-time, all-the-time demands of consumers and business, translating complex events and signals into actionable insights, and orchestrating teams across businesses in service or revenue and productivity,” said Jennifer Tejada, CEO of PagerDuty.

Lastly, Cloudera yesterday announced the acquisition of Fast Forward Labs, an applied research and advisory services company specializing in machine learning and applied AI.

Now known as Cloudera Fast Forward Labs, the company is focused on practical research into data science, and applying that research to broad business problems.

Top 5 machine learning libraries for Java

Companies are scrambling to find enough programmers capable of coding for ML and deep learning. Are you ready? Here are five of our top picks for machine learning libraries for Java.

The long AI winter is over. Instead of being a punchline, machine learning is one of the hottest skills in tech right now. Companies are scrambling to find enough programmers capable of coding for ML and deep learning. While no one programming language has won the dominant position, here are five of our top picks for ML libraries for Java.


It comes as no surprise that Weka is our number one pick for the best Java machine learning library. Weka 3 is a fully Java-based workbench best used for machine learning algorithms. Weka is primarily used for data mining, data analysis, and predictive modelling. It’s completely free, portable, and easy to use with its graphical interface.

“Weka’s strength lies in classification, so applications that require automatic classification of data can benefit from it, but it also supports clustering, association rule mining, time series prediction, feature selection, and anomaly detection,” said Prof. Eibe Frank, an Associate Professor of Computer Science at the University of Waikato in New Zealand.

Weka’s collection of machine learning algorithms can be applied directly to a dataset or called from your own Java code. This supports several standard data mining tasks, including data preprocessing, classification, clustering, visualization, regression, and feature selection.

MOA is an open-source software used specifically for machine learning and data mining on data streams in real time. Developed in Java, it can also be easily used with Weka while scaling to more demanding problems. MOA’s collection of machine learning algorithms and tools for evaluation are useful for regression, classification, outlier detection, clustering, recommender systems, and concept drift detection. MOA can be useful for large evolving datasets and data streams as well as the data produced by the devices of the Internet of Things (IoT).

MOA is specifically designed for machine learning on data streams in real time. It aims for time- and memory-efficient processing. MOA provides a benchmark framework for running experiments in the data mining field by providing several useful features including an easily extendable framework for new algorithms, streams, and evaluation methods; storable settings for data streams (real and synthetic) for repeatable experiments; and a set of existing algorithms and measures from the literature for comparison.


Last year the JAXenter community nominated Deeplearning4j as one of the most innovative contributors to the Java ecosystem. Deeplearning4j is a commercial grade, open-source distributed deep-learning library in Java and Scala brought to us by the good people (and semi-sentient robots!) of Skymind. It’s mission is to bring deep neural networks and deep reinforcement learning together for business environments.

Deeplearning4j is meant to serve as DIY tool for Java, Scala and Clojure programmers working on Hadoop, the massive distributed data storage system with enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The deep neural networks and deep reinforcement learning are capable of pattern recognition and goal-oriented machine learning. All of this means that Deeplearning4j is super useful for identifying patterns and sentiment in speech, sound and text. Plus, it can be used for detecting anomalies in time series data like financial transactions.


Developed primarily by Andrew McCallum and students from UMASS and UPenn, MALLET is an open-source java machine learning toolkit for language to text. This Java-based package supports statistical natural language processing, clustering, document classification, information extraction, topic modelling, and other machine learning applications to text.

MALLET’s specialty includes sophisticated tools for document classification such as efficient routines for converting text. It supports a wide variety of algorithms (including Naïve Bayes, Decision Trees, and Maximum Entropy) and code for evaluating classfier performance. Also, MALLET includes tools for sequence tagging and topic modelling.


The Environment for Developing KDD-Applications Supported by Index Structures (ELKI for short) is an open-source data mining software for Java. ELKI’s focus is in research in algorithms, emphasizing unsupervised methods in cluster analysis, database indexes, and outlier detection. ELKI allows an independent evaluation of data mining algorithms and data management tasks by separating the two. This feature is unique among other data mining frameworks like Weta or Rapidminer. ELKI also allows arbitrary data types, file formats, or distance or similarity measures.

Designed for researchers and students, ELKI provides a large collection of highly configurable algorithm parameters. This allows fair and easy evaluation and benchmarking of algorithms. This means ELKI is particularly useful for data science; ELKI has been used to cluser sperm whale vocalizations, spaceflight operations, bike sharking redistribution, and traffic prediction. Pretty useful for any grad students out there looking to make sense of their datasets!


Do you have a favorite machine learning library for Java that we didn’t mention? Tell us in the comments and explain why it’s a travesty we forgot about it!