- Apps that are generating and using an intensive amount of data
- The amount of data generation and usage increases quickly and the complexity of data or the speed of usage in data increases quickly
Components of Data-intensive application
Main components of data-intensive application
- Database: Source of truth for any consumer. Eg: Oracle/MySQL
- Cache: Temporarily storing an expensive operation to speed up reading. Eg: Memcache
- Full-text index: For querying searching data by keyword or filter: Eg: Apache Lucene
- Message queue: For messaging processing between processes: Eg: Apache Kafka
- Stream Processing: Eg: Apache Samza, Apache Spark
- Batch Processing: For crunching large amounts of data. Eg: Apache Spark/ Hadoop
- Application Code: Acts as the connective tissue between the above components.
Role of application developer
Role of Applicatn Developer is to design the data system for reliability, scalability & maintainibliity
Reliability
- Fault tolerance (Human, Sofware, Hardware)
- No unauthorized access
- Chaos testing
- Free from machine failure
- Bugs -> Automating testing
- Staging/testing environement
- Quickly roll-back
Scalability
- Higher traffic volume
- Traffic load with peaks no of read-write and simultaneous users. (can be handled by front load/backload processing)
- Capacity planning
- Response time (online system) vs throughput (analytics)
- End-user response time
- 90th, 95th percentile SLO/A (Service Level Objective/ Service Level Agreement)
- Scaling up (more powerful machine) [vertical scaling]
- Scale-out (distributed among many smaller machines) [horizontal scalign]
Maintainability
- Add new people to work
- Productivity
- Operation: Configurable & testable
- Simple: Easy to understand and ramp up
- Evolable: Easy to change (Refactor timely, design pattern, abstraction, reduce code debt)
Note: Remember a solution will not work for every system
Abstraction
Abstraction me providing the right amount of information but not too much meaning.
Layers of Abstraction
Layers of Data Abstraction
Data Model
User -> friends -> Posts -> likes
Relational Model table diagram
Relational query
select from table (join)
Document based model
JSON documents for the models
Document based query
Graph Model
graph diagram of the above
graph query
Data model + data querying works hand in hand(you need to analyze both in order to make the decision on what to use)
Data Storage and Data Retrieval are the two main things to consider. The same query can be either written in 2 lines or 10 lines depending on our data models
Document database target use case where data comes in self-constrained document and relationships b/w one document and another are rare
Graph database go in the opposite direction targeting use cases where everything is potential related to everything
Types of Databases
There are many types of databases but we are only focusing on these 3
- Relational Databases
- Document based databases
- Graph databases
Relational Database
- Optimized for transactions & batch processing
- Data organized as tables/relations
- Object relation mapping required
- Oracle, mysql, PostGre Sql etc.
Document based databases
- NoSql – Not Only SQL
- Flexible schema, better performance due to locality/ high write throughtput
- Mainly free & open source. (espresso, couchdb, mongodb…)
Graph databases
- Best suited for highly interconnected data many to many relations
- Social graph, web graph (Neo4j, Anzo graph, SPARQL, Cypher)
Some points to remember
SQL: Enforcement happes by database
NoSQL: Enforecement happens at read level
Document databases target use cases where data comes in self-contained document and relationships between one document to another document is rare
Graph database go int he opposite direction targeting use cases where anyting is potentially related to everything
Relational and Document hybrid by Graph database is also a practive gaining some traction
Storage & Retrieval
Which database to use?
Every storage engine is optimized for different use cases. Select the right storage engine for your use case
As an application developers, we need to have a rough idea of what the storage engine is doing under the hood so that we. can select the right one.
Tuning & Optimizing pointers
Databases can be boradly categoried into OLTP & OLAP each with different read patter, write pattern, user using it, data size etc
- OLTP: Online Transaction Processing database are optimized for latency -> eg: mysql, oracle
- OLAP-> Online Analytical Processing db are optimize to data crunching, data warehousing – Star/snowflacke schema, column orinted, Column compression, data cubes, optimized for real/queries. (Materialized views, lack of flexibility eg: hbase, hive, spark)
OLTP are typically row based
OLAP are typicall column based
Row store
+ Easy to add/modify a recore
– Might read in unneccessary data
(data size is generally in MB or GB) and are suitable for end users system
Column store
+ Only need to read in relevant data
– tuple write requires multiple accesses
(suitable for read-monstly, read intensive, large data repository like inhouse analytics system)
(data size is generatelly in GB/ PB) -> big data