Big Data Terms Every Manager Should Know

Simple and Advanced technical Big Data terms

Tom Potanski

Reviewed by Rebecca Botvin

Last updated on July 12, 2020 | 17 min read

Big Data Consulting Data Science Glossary Guide Technology Terms

Big Data Terms Every Manager Should Know

Getting started with Big Data (BD)?
Perhaps you have already got your feet wet in the world of BD, but still looking to expand your knowledge and cover the subjects you have heard of but did not quite have time to cover?
Well, you have come to the right place.

This Big Data Glossary will briefly introduce you to the most important terms. We assure you it will be an easy and nice read!

It is not by any means exhaustive, but a good, light read prep before a meeting with a Big Data director or vendor – or a quick revisit before a job interview. Also, if you are interested in similar terms related to Artificial Intelligence, I would encourage you to visit a similar blog post on our partner company – Sigmoidal.
At the very beginning, let us explain the most important term when speaking of Big Data – Big Data itself. It is a term regarding large and complex datasets, whose analysis requires high computing power and can lead to extracting essential information and acquiring new knowledge. Here you can read more about Big Data and its applications.

Have you ever wondered how much data we generate? Tones of Gigabytes. Watch this video and find out what Big Data really is.
When you already know what Big Data is, it is a high time to jump into more advanced definitions. Below are the technical terms our engineers at DevsData consider the most essential.
So, let’s get started!

Business Terms

Artificial Intelligence

Artificial Intelligence is an intelligence presented by machines. It lets them perform tasks normally reserved for humans, such as speech recognition, visual perceptions, decision making or predict some information.

Business Intelligence

Business Intelligence is a procedure of processing the raw data and looking for valuable information for the purpose of improving and better understanding the business. Using BI can help to make fast and accurate business decisions.

We are serious about security

We've worked with sensitive financial data before; we genuinely care about security and pay close attention to details.

Biometrics

Biometrics is a technology linked to recognizing people by their physical traits, like face, height, etc. It uses Artificial Intelligence algorithms.

Cloud computing

Cloud computing is a term describing computing resources stored and running on remote servers. The resources, including software and data, can be accessed from anywhere by means of the internet.

Data Scientist

Big Data Scientist is a person who can take structured and unstructured data points and use his formidable skills in statistics, maths, and programming to organize them. He applies all his analytical power such as contextual understanding, industry knowledge, and understanding of existing assumptions to uncover the hidden solutions for business development.

Data Visualization

Data visualization is a proper solution when a quick look at a large amount of information is required. Using graphs, charts, diagrams, etc. allows the user to find interesting patterns or trends in the dataset. It also helps when it comes to validating data. The human eye can notice some unexpected values when they are presented in a graphical way.

Internet of Things

Internet of things, IoT, in short, is a conception of connecting devices, such as house lighting, heating or even fridges to a common network. It allows storing big amounts of data, which can later be used in real-time analytics. This term is also connected with a smart home, a concept of controlling house with phone etc.

Machine Learning

Machine Learning is the ability of computers to use them without programming new skills directly. In practice, this means algorithms that learn from data when processing them and use what they have learned to make decisions. Machine learning is used to exploiting the opportunities hidden in big data.

I’ve worked with DevsData on numerous projects over the last 3 years and I’m very happy. They demonstrated a strong degree of proactivity, taking time to thoroughly understand the problem and business perspective. The solutions they designed exceeded my expectations.

Jonas Lee

PARTNER & EXECUTIVE VP OF VERUS FINANCIAL LLC;
INVESTOR & SERIAL ENTREPRENEUR

Search Engine

It is a software system that conducts web research under the conditions specified by the user in the search query. The most popular search engines are Google, Yahoo, and Bing. Big Data, undoubtedly, plays a spectacular role in the development of search engines. The complex systems crawl through the web and process the URLs related to the query provided by the user in order to return the most accurate results.

Neural Network

Neural networks are a series of algorithms that are recognizing relationships in datasets, through a process that is similar to the functionality of the human brain. An important factor of this system is that it can generate the best possible result, without redesigning criteria for the output. Neural networks are very useful in financial areas, for instance, they can be used to forecast stock market prices.

5 V’s of BIG Data

Big data can be also described with these 5 words:

Volume – a large amount of data
Velocity – the speed of data processing
Variety – large data diversity
Veracity – verification of data
Value – what big data can bring to the user

Some say that now we there are 8V’s with Visualization, Viscosity and Virality being new ones.

Enterprise resource management

We've built enterprise software- intuitive applications for resource planning, and managing tasks and projects in an effective way. Thanks to ERP, companies can increase productivity and efficiency, reduce operating costs, and increase profits.

Simple Technical Big Data Terms

Algorithm

It is a simple term that is absolutely essential when speaking of Big Data. An algorithm is a mathematical formula or a set of instructions that we provide to the computer which describes how to process the given data in order to obtain needed information.

Concurrency

Concurrency is an ability to manage multiple tasks at the time. It helps to deal with lots of processes performed by a machine. The most known example of concurrency is multitasking.

Comparative Analytics

I’ll be going little deeper into analysis in this article as big data’s holy grail is in analytics. Comparative analysis, as the name suggests, is about comparing multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics etc. I know it’s getting little technical but I can’t completely avoid the jargon. Comparative analysis can be used in healthcare to compare large volumes of medical records, documents, images etc. for more effective and hopefully accurate medical diagnoses.

Data Warehouse

It is a system that stores data in order to analyze and process it in the future. The source of data can vary, depending on its purpose. Data can be uploaded from the company’s CRM systems as well as imported from external files or databases.

Data Lake

A data lake is a repository that stores a huge amount of raw data in its original format. While the hierarchical data warehouse stores information in files and folders, a data lake uses a flat architecture to store data. Each item in the repository has a unique identifier and is marked with a set of metadata tags. When a business query appears, the repository can be searched for specific information, and then a smaller, separate set of data can be analyzed to help solve a specific problem.

Data Mining

It is an analytical process designed to study large data resources in search of regular patterns and systematic interrelationships between variables, and then to evaluate the results by applying the detected patterns to new subsets of data. The final goal of data mining is usually to predict client behavior, sales volume, the likelihood of client loss, etc.

Data Cleansing

Data cleansing is a process of correcting or removing data or records from the database. This step is extremely important. During collecting data from sensors, websites or web scraping, some incorrect data may occur. Without cleansing, the user would be at risk of coming to wrong conclusions after analyzing this data.

Metadata

Metadata is data that describes other data. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For example, author, date created and date modified and file size are very basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets and web pages.

NoSQL

The term NoSQL is an abbreviation of “not only SQL”. It describes databases or database management systems, which deal with non-relational or non-structures data. Their flexibility results in the fact that they are very commonly used while processing large amounts of data.

R & Python

Both R and Python are one of the most commonly used open-source programming languages for Big Data. Python is considered to be slightly more friendly for beginner users than R. Besides, it is very flexible and efficient while processing large datasets. On the other hand, R is more specialized, as it is predominantly used for statistics. It has a large number of users, who voluntarily contribute to its development, by adding new libraries and packages, for example ggplot2 used for data visualization.

Server

The server is a computer, which receives requests related to applications. Its task is to respond to those requests or send it over a network. This term is commonly used in big data.

SQL

SQL is a language used to manage and query data in relational databases. There are many relational database systems, such as MySQL, PostgreSQL or SQLite, etc. Each of these systems has its own SQL dialect, which slightly differs from others.

Queries

Queries are the questions used in order to communicate with the database. Usually, iterating over big datasets would be very time-consuming. In such a case, queries can be created i.e. database can be asked to return all records where a given condition is satisfied.

Woman holding laptop Data center are places responsible for storing and processing large amounts of data.

Do you have IT recruitment needs?

🎧 Schedule a meeting

Advanced Technical Terms

Cluster Analysis

It is an explorative analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, respondents. Cluster analysis is used to identify groups of cases if the grouping is not previously known. Because it is explorative it does make any distinction between dependent and independent variables. The different cluster analysis methods that SPSS offers can handle binary, nominal, ordinal, and scale (interval or ratio) data.

Fuzzy Logic

Fuzzy logic is an approach to logic, which instead of judging, whether a statement is true or not (values 0 or 1), it tells the degree, how much the statement is close to the truth (values from 0 to 1). This approach is commonly used in Artificial Intelligence.

Hadoop

When people think of big data, they immediately think about Hadoop. Hadoop (with its cute elephant logo) is an open-source software framework that consists of what is called a Hadoop Distributed File System (HDFS) and allows for storage, retrieval, and analysis of very large data sets using distributed hardware.
Sounds complicated?
If you really want to impress someone, talk about YARN (Yet Another Resource Scheduler) which, as the name says, is a resource scheduler. I am really impressed by the folks who come up with these names. Apache foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (yup, they are all names of various software pieces). Aren’t you impressed with these names?

MapReduce

It is a technique used to process large datasets with the parallel distributed algorithm on the cluster. There are two types of tasks MapReduce is responsible for. The “Map” is used to divide data and process the data at the node level. “Reduce” collects answers of Map and finds the answer to the query.

MongoDB

MongoDB is one of the most popular NoSQL database systems. It stores data in documents, written in JSON-like format. Since it is written in a relatively low-level language (C++), it gives hugely high performance.

Object databases

Object database stores data in the form of objects. Term “object” means the same thing as in object-oriented programming, which simply means an entity of a certain class.

Cross-industry expertise

Over the years, we've accumulated expertise in building software and conducting recruitment processes for various domains. Below are six industries in which we have particularly strong knowledge.

Retail/e-commerce

Construction

Pharmaceutical

Telecom

Financial services,
hedge funds

Media &
entertainment

Parallelism

Parallelism is a similar term to concurrency, but it has a slight, but important difference. Not only does parallelism let manage multiple tasks at the time, but also it lets perform multiple tasks at the time, thanks to using multicore processors; every core performs its own task.

Spark (Apache Spark)

Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce that we discussed earlier.

TensorFlow & Keras

By default, Python does not have any implementations of machine learning algorithms or data structures. A developer needs to implement them by themselves, or use already prepared libraries such as TensorFlow or Keras. Tensorflow is an open-source library for symbolic math calculations as well as machine learning. It has implementations for many languages, like Python, Javascript, etc. Code written with is low-level, which can give reasonably high performance. Keras is also an open-source library used with Python, however, code is more high-level, which makes the library itself more friendly for machine learning beginners than TensorFlow.

Take Away

Ok, that was helpful – now what?
Since you are already acquainted with all big data terms every manager should know, you can read how to write a better code, check out examples of difficult JavaScript interview questions, or – send us an email to discuss if Big Data solutions could be applied to your business case (general@devsdata.com). Also, if you are interested in a real-life example of Big Data in real estate, you can check Adradar, which is a search engine for property and real estate, based on AI and Big Data methods.

Applying Big Data in Your business will lead to Big profits and new opportunities for Your
company.

Any questions or comments? Let me know on Twitter/X.

Discover how IT recruitment and staffing can address your talent needs. Explore trending regions like Poland, Portugal, Mexico, Brazil and more.

🗓️ Schedule a consultation

Read full bio

Tom Potanski Managing Director

Tom is a passionate and experienced technology leader with 12 years of commercial experience in software and technology. His focus is on merging business with technology to help American clients find top technical talent in Europe and Latin America. He leverages industry insights and strategic thinking to connect companies with the right professionals, building lasting client relationships.