//

Data Intensive Applications

  • Apps that are generating and using an intensive amount of data
  • The amount of data generation and usage increases quickly and the complexity of data or the speed of usage in data increases quickly

Components of Data-intensive application

Main components of data-intensive application

  • Database: Source of truth for any consumer. Eg: Oracle/MySQL
  • Cache: Temporarily storing an expensive operation to speed up reading. Eg: Memcache
  • Full-text index: For querying searching data by keyword or filter: Eg: Apache Lucene
  • Message queue: For messaging processing between processes: Eg: Apache Kafka
  • Stream Processing: Eg: Apache Samza, Apache Spark
  • Batch Processing: For crunching large amounts of data. Eg: Apache Spark/ Hadoop
  • Application Code: Acts as the connective tissue between the above components.

Role of application developer

Role of Applicatn Developer is to design the data system for reliability, scalability & maintainibliity

Reliability

  • Fault tolerance (Human, Sofware, Hardware)
  • No unauthorized access
  • Chaos testing
  • Free from machine failure
  • Bugs -> Automating testing
  • Staging/testing environement
  • Quickly roll-back

Scalability

  • Higher traffic volume
  • Traffic load with peaks no of read-write and simultaneous users. (can be handled by front load/backload processing)
  • Capacity planning
  • Response time (online system) vs throughput (analytics)
  • End-user response time
  • 90th, 95th percentile SLO/A (Service Level Objective/ Service Level Agreement)
  • Scaling up (more powerful machine) [vertical scaling]
  • Scale-out (distributed among many smaller machines) [horizontal scalign]

Maintainability

  • Add new people to work
  • Productivity
  • Operation: Configurable & testable
  • Simple: Easy to understand and ramp up
  • Evolable: Easy to change (Refactor timely, design pattern, abstraction, reduce code debt)

Note: Remember a solution will not work for every system

Abstraction

Abstraction me providing the right amount of information but not too much meaning.

Layers of Abstraction

Layers of Data Abstraction

Data Model

User -> friends -> Posts -> likes

Relational Model table diagram

Relational query

select from table (join)

Document based model

JSON documents for the models
Document based query

Graph Model

graph diagram of the above
graph query

Data model + data querying works hand in hand(you need to analyze both in order to make the decision on what to use)

Data Storage and Data Retrieval are the two main things to consider. The same query can be either written in 2 lines or 10 lines depending on our data models

Document database target use case where data comes in self-constrained document and relationships b/w one document and another are rare

Graph database go in the opposite direction targeting use cases where everything is potential related to everything

Types of Databases

There are many types of databases but we are only focusing on these 3

  • Relational Databases
  • Document based databases
  • Graph databases

Relational Database

  • Optimized for transactions & batch processing
  • Data organized as tables/relations
  • Object relation mapping required
  • Oracle, mysql, PostGre Sql etc.

Document based databases

  • NoSql – Not Only SQL
  • Flexible schema, better performance due to locality/ high write throughtput
  • Mainly free & open source. (espresso, couchdb, mongodb…)

Graph databases

  • Best suited for highly interconnected data many to many relations
  • Social graph, web graph (Neo4j, Anzo graph, SPARQL, Cypher)

Some points to remember

SQL: Enforcement happes by database

NoSQL: Enforecement happens at read level

Document databases target use cases where data comes in self-contained document and relationships between one document to another document is rare

Graph database go int he opposite direction targeting use cases where anyting is potentially related to everything

Relational and Document hybrid by Graph database is also a practive gaining some traction

Storage & Retrieval

Which database to use?

Every storage engine is optimized for different use cases. Select the right storage engine for your use case

As an application developers, we need to have a rough idea of what the storage engine is doing under the hood so that we. can select the right one.

Tuning & Optimizing pointers

Databases can be boradly categoried into OLTP & OLAP each with different read patter, write pattern, user using it, data size etc

  • OLTP: Online Transaction Processing database are optimized for latency -> eg: mysql, oracle
  • OLAP-> Online Analytical Processing db are optimize to data crunching, data warehousing – Star/snowflacke schema, column orinted, Column compression, data cubes, optimized for real/queries. (Materialized views, lack of flexibility eg: hbase, hive, spark)

OLTP are typically row based

OLAP are typicall column based

Row store

+ Easy to add/modify a recore

– Might read in unneccessary data

(data size is generally in MB or GB) and are suitable for end users system

Column store

+ Only need to read in relevant data

– tuple write requires multiple accesses

(suitable for read-monstly, read intensive, large data repository like inhouse analytics system)
(data size is generatelly in GB/ PB) -> big data

Leave a Reply

Your email address will not be published. Required fields are marked *