SOA stands for Service-Oriented Architecture.
It is the reliance on web services to send and receive data. The services are loosely coupled, meaning any client on any platform can connect to any service as long as the essential contracts are met.
Clients can consume multiple services, and services can have multiple clients.
The decomposition of a system into autonomous (or nearly autonomous) units of responsibility, and exposure of those units in a secure and controlled fasion.
Exposure of an API for your system.
The client has limited functionality access.
Clients access functionality by making service calls.
Service-oriented architecture is a design approach where multiple services collaborate to provide some end set of capabilities. A service here typically means a completely separate operating system process. Communication between these services occurs via calls across a network rather than method calls within a process boundary.
(Summarized) The SOA movement got co-opted by vendors for profit and never explained how to achieve its architectural goals. The microservice movement (an extension of SOA) fills in the details of how to design, implement, and maintain such a system.
Comparison from the client side:
- OOP is when you use a library. You have access to all the individual objects. You need a lot of knowledge about how to use the objects together, what order to run operations in, etc.
- SOA abstracts all those details away, only exposing high-level operations that always leave the application in a valid state (i.e. they are stateless because state does not have to be maintained between calls).
Comparison from the application side:
- OOP is concerned with the lower-level design decisions.
- SOA is concerned with the higher-level design decisions.
- Reusability: two applications can call on the same service on the backend
- Maintenance: a service can be altered or replaced without client applications being aware of it
An application (in this distinction) is a single executable running on a single machine. It is not the responsibility of the application to solve connectivity problems, such as losing access to a storage location.
A system include multiple executables on (possibly) multiple machines. The system is responsible for solving its own connectivity problems.
Ex: a web page is a system because it involves the server, the client's machine and browser, and probably at least one database.
RPC stands for Remote Procedure Call, such as a request/response. They are synchronous
Message queues are asynchonous.
Services that communicate with the client or with each other using a set of industry-accepted standards and protocols.
A service is a collection of operations (units of responsibility).
The service is the point of entry for the client.
Services are secure. (They handle any required internal security. They also inherently limit client access to their system.)
Service operations always leave the system in a consistent (valid) state.
Services handle faults/exceptions gracefully.
The client is not exposed to any details of how an operation is fulfilled.
The client is protected from code volatility (likelihood of change).
Service calls are (almost always) stateless.
A service is a logical representation of a repeatable business activity with a specific outcome. It is self-contained. It is a black box to programs that consume it.
A service can be composed of several other services.
An application whose volatile areas are abstracted, or wrapped, in a simple service call that is exposed to the client.
Each service call is called an "operation".
A set of actions that must all succeed together or all fail together.
Services interact by sending messages across boundaries. The boundaries are formal and explicit. No assumption is made about what is behind the boundary.
Ex: the client does not know what kind of database or file system the service is using.
A service does not care how a message was created, or what will happen in the client after the service performs its actions.
Therefore, version and deploy the service independently from the client.
Therefore, design service contracts assuming that once published, they cannot change.
Only message pass between services, not code. The messages are not random; they have been agreed upon.
A service must express what it does and how a client should communicate with it in a standard representation.
Services adhere to an explicit service description.
Services minimize dependence on each other.
Services hide the logic they encapsulate.
Divide business logic into several services with the intent of maximizing reuse.
Services have control over the logic they encapsulate.
Services should be stateless.
The client should not be required to know that operation A, B, and C must be called in a specific order, or that operation D must be followed by E.
Such specific ordering should be contained in a single operation.
Services can be discovered, usually in a service registry. Ie, a client can invoke a service regardless of its actual location in the network.
Services break big problems into little problems.
Services use standards that allow diverse clients to use the service.
[Advanced Distributed System Design (online course)]
The network is reliable.
The network is secure.
The network is homogenous.
The topology won't change.
Latency isn't a problem.
Bandwidth isn't a problem.
Transport cost isn't a problem.
The administrator will know what to do.
(from Deutsch and Gosling, in the 1970s)
Everything from the 8 Fallacies of Distributed Computing, plus
The system is atomic and/or monolithic.
The system is finished.
Business logic can and should be centralized.
(from Neward)
(The name is a joke about counting from 0)
How do you handle HttpTimeoutException?
- is the process running correctly, but it's just taking a while?
- did the request fail?
You can log the error and continue.
You can forward the error to the user.
You can retry.
- but what if the request actually succeeded already?
- but what about transactional integrity?
You can use a reliable messaging service. (MSMQ, Azure Service Bus, WebSphere, etc)
- message queues are designed to handle retry/store and forward/transaction integrity behaviors
- what you lose is request/response - everything will be asynchronous
System design is very different when you cannot rely on request/response interactions.
Roundtrip time: time from invocation to response being received
- includes serialization, server processing, and deserialization
Latency time: time to cross the network in one direction
- more useful metric for distributed systems
Normalized latencies, based on 1 CPU cycle = 1 second
- Main memory access = 6 minutes
- Hard disk access = 2-6 days
- Internet from San Francisco to New York = 4 years
- Internet from San Francisco to England = 8 years
- Internet from San Francisco to Australia = 19 years
- TCP packet retransmit = 105-317 years
- OS virtualization system reboot = 423 years
- SCSI command timeout = 3 millennia
- Physical system rebook = 32 millennia
Distributed systems can have terrible latency issues.
If you use polymorphism and dependency injection, it may not be clear to when you are even making a remote call.
If you use cloud services, it may not be clear how often a system will have to reboot under you.
If you use microservices, which make many remote calls, that will be slower than a single service.
If you use an ORM with lazy-loading, it may not be clear how many database queries are being run.
In general, avoid using remote objects.
- ORM lazy loading is an example of remote objects
- tightly coupled request/response interactions are examples of remote objects
Bandwidth capacities have grown slower than CPU speed or memory.
How much is a gigabit of bandwidth?
- that's about 128 megabytes
- TCP overhead will use up half of it - down to 60 megabytes
- take another half for the structure of whatever data format you use - down to 30 megabytes
Putting more servers on the same network will not increase your bandwidth.
To avoid wasting bandwidth, you want to load as little data as possible.
But that conflicts with latency concerns, which recommend you load as much data at once as possible.
The best you can do is load exactly the data you need at one time.
You may need to decompose your domain model into multiple smaller, more specific models, each one designed to handle a subset of scenarios.
You can move time-critical data to its own network.
- either truly a separate network
- or subdivide your bandwidth and allocate a portion of it to each service
- this is a good reason to divide services into smaller services - based on their bandwidth priority
Nothing is secure. Even computers that are disconnected from the network are at risk every time you put in a disk or thumbdrive. Every person with access to the system is a security risk. Faster computers means it's easier to crack encryption.
With cloud computing, the network topology can change on the fly.
With callback contracts (and similar) the server remembers the address of the client so it can push updates to them. What if the client isn't there anymore? HTTP timeouts default to 30 seconds, so it'll be that long to realize the client isn't available. This could even be used as a denial of service attack.
Can the system the designed so that it can preserve its performance when the network topology changes?
So
Don't hardcode ip addresses, domain names, etc.
Consider multicasting (talking to a range of addresses). This can be insecure.
Consider discovery mechanisms for self-configuring systems. It works well until it doesn't work.
Performance test early, starting at 30% of the way through a project. Test with various servers going down or rebooting.
Even if an admin actually knows the whole system top to bottom, you can't rely on them working for you forever. How long will it take their replacement to get up to speed? To become an expert?
Configuration is magic if you don't understand it, and don't expect its effects.
Some configuration may be in a text file. Some may be in the database. Some may be in environment variables.
So
Use a configuration management system. See the ITIL standard.
Keep up-to-date documentation.
Save stable deployment points.
Backup the system.
IT will often get push back from the business about when they can/cannot deploy updates, because the business knows that errors are likely to be introduced. This results in large deployments (more likely to contain errors) being deployed less frequently.
It is better to move towards continuous deployment. Deploy many small updates so often that deployment is not worrying. Errors are easier to track down because the amount of changes that could have caused them is smaller.
If you can make sure the new code is backward compatible, then you can have little or no downtime while you upgrade, because you can upgrade one server at a time. When you test, test multiple versions of the system running in parallel.
Queueing can help with downtime, because while one part of the system is down, another part can still be adding messages to the queue. Or one part can be consuming from the queue while the producer is down.
The network is reliable.
The network is secure.
The network is homogenous.
The topology won't change.
Latency isn't a problem.
Bandwidth isn't a problem.
Transport cost isn't a problem.
The administrator will know what to do.
(from Deutsch and Gosling)
Everything from the 8 Fallacies of Distributed Computing, plus
The system is atomic and/or monolithic.
The system is finished.
Business logic can and should be centralized.
(from Neward)
(The name is a joke about counting from 0)
ESB stands for Enterprise Service Bus.
A service bus implements a communication system between mutually interacting software.
Instead of N services each communicating with (N-1) services, all services communicate with the service bus only, resulting in only N lines of communication. The service bus passes messages through to the target service.