Kubernetes Migration Case Study

Posted on January 26, 2021 by itibble@gmail.com

Migrating Netdelta From Docker to Kubernetes

In latish 2020, I moved Netdelta from a Docker deployment to Kubernetes, partly to see what all this Kubernetes jazz is about, and partly to investigate whether it would help me with the management of Netdelta containers for different punters, each of whom has their own docker container and Apache listening service.

I studiously went through the Kubernetes quick tutorial and found i had to investigate the documentation some more. Even then some aspects weren’t covered so well. This post explains what i did to deploy an app into Kubernetes, and some of the gotchas i encountered along the way, that were not covered so well in the Kubernetes documentation, and I summarise with a view of Kubernetes and give my view on: is the hype justified? Will I continue to host Netdelta in Kubernetes?

This is not a Kubernetes tutorial – it does assume some prior exposure on behalf of the reader, but nonethless links to the relevant documentation when some Kubernetes concepts are covered.

Netdelta in Docker
Data Flows / Networking
Reverse Proxy
K8s Nginx Ingress Controller
DNS
The Application Hosting for Kubernetes
- Kubernetes Equivalent of Docker Entrypoint Script Parameters
- Private Image Repository
PV Mounts
Database and Fileserver
To K8s Or Not To K8s?

Netdelta in Docker

This post isn’t about Netdelta, but for illustrative purposes: Netdelta aids with the detection of unauthorised changes, and hacker shells, by running one-off port scans, or scheduled jobs, comparing the results with the previous scan, and alerting on changes. This is more chunky than it sounds, mostly because of the analytics that goes into false positives detection. In the Kubernetes implementation, scan results are held in a stateful persistent volume with MySQL.

Netdelta’s docker config can be dug into here, but to summarise the docker setup:

Database container – MySQL 5.7
Application container – Apache, Django 3.1.4, Celery 5.0.5, Netdelta
Fileserver (logs, virtualenv, code deployment)
Docker volumes and networking are utilised

Data Flows / Networking

The data flows aspect reflects what is not exactly a bare metal deployment. A Linode-hosted VM running Ubuntu 20 is the host, then the Kubernetes node is minikube, with another node running on a Raspberry pi 3 – the latter aspect not being a production facility. The pi 3 was only to test how well the config would work with load balancing, and Kubernetes Replicasets across nodes.

Reverse Proxy

Ingress connections from the internet are handled first by nginx acting as a reverse proxy. Base URLs for Netdelta are of the form https://www.netdelta.io/<site>. The nginx config …

server {
    listen 80;
    location /barbican {
	proxy_set_header Accept-Encoding "";
	sub_filter_types text/html text/css text/xml;
	sub_filter $host $host/barbican;
        proxy_pass http://local.netdelta.io/barbican;
    }
}

K8s Ingress Controller

This is passing a URL with a first level of <site> to be processed at local.netdelta.io, which is locally resolvable, and is localhost. This is where the nginx Kubernetes Ingress Controller comes into play. The pods in kubernetes have NodePorts configured but these aren’t necessary. The nginx ingress controller takes connections on port 80, and routes based on service names and the defined listening port:

┌──(iantibble㉿bionic)-[~]
└─$ kubectl describe ingress
Name:             netdelta-ingress
Namespace:        default
Address:          172.17.0.2
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
  Host               Path  Backends
  ----               ----  --------
  local.netdelta.io
                     /barbican   netdelta-barbican:9004 (<none>)
Annotations:         <none>
Events:              <none>

The YAML looks thusly:

┌──(iantibble㉿bionic)-[~/netdd/k8s]
└─$ cat ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: netdelta-ingress
spec:
  rules:
    - host: local.netdelta.io
      http:
        paths:
          - path: /barbican
            backend:
              service:
                name: netdelta-barbican
                port:
                  number: 9004
            pathType: Prefix

So the nginx ingress controllers sees the connection forwarded from local.netdelta.io with a URL request of local.netdelta.io/<site>. The requests matches a rule, and forwards to the Kubernetes Service of the same name. The entity that actually answers the call is a docker container masquerading as a Kubernetes Pod, which is part of a deployment. The next step in the data flow is to route the connection to the specified Kubernetes Service which is covered briefly here but in more detail later in the coverage of DNS.

The “service” aspect has the effect of exposing the pod according to the service setup:

┌──(iantibble㉿bionic)-[~/netdd/k8s]
└─$ kubectl get services -o wide
NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE    SELECTOR
kubernetes          ClusterIP   10.96.0.1                443/TCP          119d   
mysql-netdelta      ClusterIP   10.97.140.111            3306/TCP         39d    app=mysql-netdelta
netdelta-barbican   NodePort    10.103.160.223           9004:30460/TCP   36d    app=netdelta-barbican
netdelta-xynexis    NodePort    10.102.53.156            9005:31259/TCP   36d    app

DNS

There’s an awful lot of waffle out there about DNS and Kubernetes. Basically – and I know the god of devops won’t let me in heaven for saying this, but making a service in Kubernetes leads to DNS being enabled. DNS in a multi-namespace, multi-node scenario becomes more intreresting of course, and there’s plenty you can configure that’s outside the scope of this article.

Netdelta’s Django settings.py defines a host and database name, and has to be able to find the host:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'.
        'NAME': 'netdelta-SITENAME',                  # Not used with sqlite3.
        'USER': 'root', # Not used with sqlite3.
        'HOST': mysql-netdelta,
        'PASSWORD': 'NOYFB',
        'OPTIONS': dict(init_command="SET sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER'"),
    }
}

This aspect was poorly documented and was far from obvious: the spec.selector field of the service should match the spec.template.metadata.labels of the pod created by the Deployment.

The Application Hosting in Kubernetes

Referring back to the diagram above, there are pods for each Netdelta site. How was the Docker-hosted version of Netdelta represented in Kubernetes?

The Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: netdelta-barbican
  name: netdelta-barbican
spec:
  replicas: 1
  selector:
    matchLabels:
      app: netdelta-barbican
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: netdelta-barbican
    spec:
      containers:
      - image: registry.netdelta.io/netdelta/barbican:1.0
        imagePullPolicy: IfNotPresent
        name: netdelta-barbican
        ports:
        - containerPort: 9004
        args:
          - "barbican"
          - "9004"
          - "le"
          - "certs"
        resources: {}
        volumeMounts:
          - mountPath: /srv/staging
            name: netdelta-app
          - mountPath: /srv/logs
            name: netdelta-logs
          - mountPath: /le
            name: le
          - mountPath: /var/lib/mysql
            name: data
          - mountPath: /srv/netdelta_venv
            name: netdelta-venv
      imagePullSecrets:
        - name: regcred
      volumes:
        - name: netdelta-app
          persistentVolumeClaim:
            claimName: netdelta-app
        - name: netdelta-logs
          persistentVolumeClaim:
            claimName: netdelta-logs
        - name: le
          persistentVolumeClaim:
            claimName: le
        - name: data
          persistentVolumeClaim:
            claimName: data
        - name: netdelta-venv
          persistentVolumeClaim:
            claimName: netdelta-venv
      restartPolicy: Always
      serviceAccountName: ""
status: {}

Running:

kubectl apply -f netdelta-app-<site>.yaml

Has the effect of creating a pod and a container for the Django application, celery and Apache stack:

┌──(iantibble㉿bionic)-[~]
└─$ kubectl get deployments
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
fileserver          1/1     1            1           25d
mysql-netdelta      1/1     1            1           25d
netdelta-barbican   1/1     1            1           25d

┌──(iantibble㉿bionic)-[~]
└─$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
fileserver-6d6bc54f6c-hq8lk          1/1     Running   2          25d
mysql-netdelta-5fd7757c66-xqp2j      1/1     Running   2          25d
netdelta-barbican-68d78c58bd-vnqdn   1/1     Running   2          25d

K8s Equivakent of Docker Entrypoint Script Parameters

Some other points perhaps worthy of mention were around the Docker v Kubernetes aspects. My docker run command for the netdelta application container was like this:

docker run -it -p 9004:9004 --network netdelta_net --name netdelta_barbican -v netdelta_app:/srv/staging -v netdelta_logs:/srv/logs -v data:/data -v le:/etc/letsencrypt netdelta/barbican:core barbican 9004 le certs

So there’s 4 parameters for the entryscript: site, port, le, and cert. The last two are about letsencrypt certs which won’t be covered here. These are represented in the Kubernetes Deployment YAML in spec.template.spec.containers.args.

Private Image Repository

spec.template.spec.containers.image is set to registry.netdelta.io/netdelta/<site>:<version tag>. Yes, that’s right folks, i’m using a private registry, which is a lot of fun until you realise how hard it is to manage the images there. The setup and management of the private registry won’t be covered here but i found this to be useful.

One other point is about security and encryption in transit for the image pushes and pulls. I’ve been in security for 20 years and have lots of unrestricted penetration testing experience. It shouldn’t be necessary or mandatory to use HTTPS over HTTP in most cases. Admittedly i didn’t spend long trying, but i could not find a way to just use good old clear-text port 80 over 443, which in turn meant i had to configure a SSL certifcate with all the management around it, where the risks are far from justifying such a measure.

PV Mounts

In Dockerland I was using Docker Volumes for persistent storage of logs and application data. I was also using it for the application codebase, and any updates would be sync’d with containers by docker exec wrapped in a BASH script.

There was nothing unexpected in the deployment of the PVCs/PVs, but a couple of points are worth mentioning:

PV Filesystem mounts: Netdelta container deployment involves a custom image from COPY (Docker command) of files from a local source to the image. Then the container is run and the application can find the required files. The problem i ran into was about having filesystems mounted over the directories where my application container expected to find files. This meant i had to change my container entryscript to sync with the image when the Pod is deployed, whereas previously the directories were built-out from the docker image build.
/tmp as default PV files location: if you SSH to the node (minikube container in my case), you will find the mounted filesystems under /tmp. /tmp is a critical directory for the good health of any Linux-based system and it needs to be 777 (i.e. read and writeable by unauthenticated users and processes) with a sticky bit. This is one that for whatever reason doesn’t find its way into security checklists for Kubernetes but it really does warrant some attention. This can be changed by customising Kubernetes Storage Classes. There’s one pointer here.

Database and Fileserver

The MySQL Database service was deployed as a custom built container with my Docker setup. There was no special reason for this other than to change filesystem permissions, and the fact that the listening service needed to be “exposed” and the database config changed to bind to 0.0.0.0 instead of localhost. What i found with the Kubernetes Pod was that I didn’t need to change the Mysql config at all and spec.ports.targetport had the effect of “exposing” the listening service for the database.

The main reason for using a fileserver in the Dcoker deployment of Netdelta was to act as a container buffer between Docker Volumes and application containers. My my Unix hat on, one is left wondering how filesystem persmissions will work (or otherwise) with file read and writes across network mounted disparate unix systems, where even if the same account names exist on each system, perhaps they have different UIDs (BSD-derived systems use the UID to define ownership, not the name on the account). Moreover it was advised as a best practice measure in the Docker documentation to use an intermediate fileserver. Accordingly this was the way i decided to go with Kubernetes, with a “sidecar” Pod as a fileserver, which mounts the PVs onto the required mount points.

To K8s Or Not To K8s?

When you think about the way that e.g. Minikube is deployed – its a docker container. If you run a docker ps -a, you can see all the mechanics at work. And then if you SSH to the minikube, you can do another docker ps -a, and you see everything to do with Kubernetes pods and containers in the output. This seems like a mess, and if it isn’t, it will do until the mess actually arrives.

Furthermore, you don’t even want to look at the routing tables or network interfaces on the node host. You just cannot unsee that.

There is some considerable complexity here. Further, when you read the documentation for Kubernetes, it does have all the air of documentation written by programmers. We hear a lot about the lack of IT-skilled people, but what is even more lacking, are strategic thinkers (e.g. * [wildcard] Architects) who translate top level business design requirements into programming tactical requirements.

Knowing how Kubernetes works should be enough to know whether it’s really going to be beneficial or not to host your containers there. If you’re not sure you need it, then you probably don’t. In the case of Netdelta, if i have lots and lots of Netdelta sites to manage then i can go with Kubernetes, and now that i have seen Netdelta happily running in Kubernetes with both scheduled celery jobs and manual user-initiated scans, the transition will be a smooth one. In the meantime, I can work with Docker containers alone, with the supporting BASH scripts, whuch are here if you’re interested.

Fintechs and Security – Part 4

Posted on April 12, 2020 by itibble@gmail.com

Prologue – covers the overall challenge at a high level
Part One – Recruiting and Interviews
Part Two – Threat and Vulnerability Management – Application Security
Part Three – Threat and Vulnerability Management – Other Layers
Part Four – Logging
Part Five – Cryptography and Key Management, and Identity Management
Part Six – Trust (network controls, such as firewalls and proxies), and Resilience

Logging

Notice “Logging” is used here, not “SIEM”. With use of “SIEM”, there is often a mental leap, or stumble, towards a commercial solution. But there doesn’t necessarily need to be a commercial solution. This post invites the reader to take a step back from the precipice of engaging with vendors, and check first if that journey is one you want to make.

Unfortunately, in 2020, it is still the case that many fintechs are doing one of two things:

Procuring a commercial solution without thinking about what is going to be logged, or thinking about the actual business goals that a logging solution is intended to achieve.
Just going with the Cloud Service Provider’s (CSP) SaaS offering – e.g. Stackdriver (now called “Operations”) for Google Cloud, or Security Center for Azure.

Design Process

The process HLD takes into risks from threat modelling (and maybe other sources), and another input from compliance requirements (maybe security standards and legal requirements), and uses the requirements from the HLD to drive the LLD. The LLD will call out the use cases and volume requirements that satisfy the HLD requirements – but importantly, it does not cover the technological solution. That comes later.

The diagram above calls out Splunk but of course it doesn’t have to be Splunk.

Security Operations

The end goal of the design process is heavily weighted towards a security operations or protective monitoring capability. Alerts will be specified which will then be configured into the technological solution (if it supports this). Run-books are developed based on on-going continuous improvement – this “tuning” is based on adjusting to false positives mainly, and adding further alerts, or modifying existing alerts.

The decision making on how to respond to alerts requires intimate knowledge of networks and applications, trust relationships, data flows, and the business criticality of information assets. This is not a role for fresh graduates. Risk assessment drives the response to an alert, and the decision on whether or not to engage an incident response process.

General IT monitoring can form the first level response, and then Security Operations consumes events from this first level that are related to potential security incidents.

Two main points relating this SecOps function:

Outsourcing doesn’t typically work when it comes to the 2nd level. Outsourcing of the first level is more likely to be cost effective. Dr Anton Chuvakin’s post on what can, and cannot be outsourced in security is the most well-rounded and realistic that i’ve seen. Generally anything that requires in-house knowledge and intimacy of how events relate to business risks – this cannot be outsourced at all effectively.
The maturity of SecOps doesn’t happen overnight. Expect it to take more than 12 months for a larger fintech with a complex cloud footprint.

The logging capability is the bedrock of SecOps, and how it relates to other security capabilities can be simplified as in the diagram below. The boxes on the left are self-explanatory with the possible exception of Active Trust Management – this is heavily network-oriented and at the engineering end of the rainbow, its about firewalls, reverse and forward proxies mainly:

Custom Use Cases

For the vast majority of cases, custom use cases will need to be formulated. This involves building a picture of “normal”, so as to enable alerting on abnormal. So taking the networking example: what are my data flows? Take my most critical applications – what are source and destination IP addresses, and what is the port on the server-side of the client-server relationship? So then a possible custom use case could be: raise an alert when a connection is aimed at the server from anywhere other than the client(s).

Generic use cases are no-brainers. Examples are brute force attempts and technology or user behaviour-specific use cases. Some good examples are here. Custom use cases requires an understanding of how applications, networks, and operating systems are knitted together. But both custom and generic use cases require a log source to be called out. For network events, this will be a firewall as the best candidate. It generally makes very little sense to deploy network IDS nodes in cloud.

So for each application, generate a table of custom use cases, and identify a log source for each. Generic use cases are those configured auto-tragically in Splunk Enterprise Security for example. But even Splunk cannot magically give you custom use cases, or even ensure that all devices are included in the coverage for generic use cases. No – humans still have a monopoly over custom use cases and well, really, most of SIEM configuration. AI and Cyberdyne Systems won’t be able to get near custom use cases in our lifetimes, or ever, other than the fantasy world of vendor Powerpoint slides.

Don’t forget to test custom use case alerting. So for network events, spin up a VM in a centrally trusted area, like a management Vnet/VPC for example. Port scan from there to see if alerts are triggered. Netcat can be very useful here too, for spoofing source addresses for example.

Correlation

Correlation was the phrase used by vendors in the heady days of the 00s. The premise was something like this: event A, event B, and event C. Taken in isolation (topical), each seem innocuous. But bake them together and you have a clear indicator that skullduggery is afoot.

I suggest you park correlation in the early stage of a logging capability deployment. Maybe consider it for down the road, once a decent level of maturity has been reached in SecOps, and consider also that any attempt to try and get too clever can result in your SIEM frying circuit boards. The aim initially should be to reduce complexity as much as possible, and nothing is better at adding complexity than correlation. Really – basic alerting on generic and custom use cases gives you most of the coverage you need for now, and in any case, you can’t expect to get anywhere near an ideal state with logging.

SaaS

Operating system logs are important in many cases. When you decide to SaaS a solution, note that you lose control over operating system events. You cannot turn off events that you’re not interested in (e.g. Windows Object auditing events which have had a few too many pizzas). This can be a problem if you decide to go with a COTS where licensing costs are based on volume of events. Also, you cannot turn on OS events that you could be interested in. The way CSPs play here is to assume everything is interesting, which can get expensive. Very expensive.

Note – its also, in most cases, not such a great idea to use a SaaS based SIEM. Why? Because this function has connectivity with everything. It has trust relationships with dev/test, pre-prod, and production. You really want full control over this platform (i.e. be able to login with admin credentials and take control of the OS), especially as it hosts lots of information that would be very interesting for attackers, and is potentially the main target for attackers, because of the trust relationships I mentioned before.

So with SaaS, its probably not the case that you are missing critical events. You just get flooded. The same applies to 3rd party applications, but for custom, in-house developed applications, you still have control of course of the application layer.

Custom, In-house Developed Applications

You have your debugging stream and you have your application stream. You can assign critical levels to events in your code (these are the classic syslog severity levels). The application events stream is critical. From an application security perspective, many events are not immediately intuitively of interest, but by using knowledge of how hackers work in practice, security can offer some surprises here, pleasant or otherwise.

If you’re a developer, you can ease the strain on your infosec colleagues by using consistent JSON logging keys across the board. For example, don’t start with ‘userid’ and then flip to ‘user_id’ later, because it makes the configuration of alerting more of a challenge than it needs to be. To some extent, this is unavoidable, because different vendors use different keys, but every bit helps. Note also that if search patterns for alerting have to cater for multiple different keys in JSON documents, the load on the SIEM will be unnecessarily high.

It goes without saying also: think about where your application and debug logs are being transmitted and stored. These are a source of extremely valuable intelligence for an attacker.

The Technology

The technological side of the logging capability isn’t the biggest side. The technology is there to fulfil a logging requirement, it is not in itself the logging capability. There are also people and processes around logging, but its worth talking about the technology.

What’s more common than many would think – organisation acquires a COTS SIEM tool but the security engineers hate it. Its slow and doesn’t do much of any use. So they find their own way of aggregating network-centralised events with a syslog bucket of some description. Performance is very often the reason why engineers will be grep’ing over syslog text files.

Whereas the aforementioned sounds ineffective, sadly its more effective than botched SIEM deployments with poorly designed tech. It also ticks the “network centralised logging” box for auditors.

The open-source tools solution can work for lots of organisations, but what you don’t get so easily is the real-time alerting. The main cost will be storage. No license fees. Just take a step back, and think what it is you really want to achieve in logging (see the design process above). The features of the open source logging solution can be something like this:

Rsyslog is TCP and covers authentication of hosts. Rsyslog is a popular protocol because it enables TCP layer transmission from most log source types (one exception is some Cisco network devices and firewalls), and also encryption of data in transit, which is strongly recommended in a wide open, “flat” network architecture where eavesdropping is a prevalent risk.
Even Windows can “speak” rsyslog with the aid of a local agent such as nxlog.
There are plenty of Host-based Intrusion Detection System (HIDS) agents for Linux and Windows – OSSEC, Suricata, etc.
Intermediate network logging Rsyslog servers can aggregate logs for network zones/subnets. There are the equivalent of Splunk forwarders or Alienvault Sensors. A cron job runs an rsync over Secure Shell (SSH), which uploads the batches of events data periodically to a Syslog Lake, for want of a better phrase.
The folder structure on the Syslog server can reflect dates – years, months, days – and distinct files are named to indicate the log source or intermediate server.
Good open source logging tools are getting harder to find. Once a tool gets a reputation, it aint’ free any mo. There are still some things you can do with ELK for free (but not alerting). Graylog is widely touted. At the time of writing you can still log e.g. 100 GB/day, and you don’t pay if you forego support or any of the other Enterprise features.

Splunk

Splunk sales people have a dart board with my picture on it. To be fair, the official Splunk line is that they want to help their customers save events indexing money because it benefits them in the longer term. And they’re right, this does work for Splunk and their customers. But many of the resellers are either lacking the skills to help, or they are just interested in a quick and dirty install. “Live for today, don’t worry about tomorrow”.

Splunk really is a Lamborghini, and the few times when i’ve been involved in bidding beauty parades for SIEM, Splunk often comes out cheaper believe it or not. Splunk was made for logging and was engineered as such. Some of the other SIEM engines are poorly coded and connect to a MySQL database for example, whereas Splunk has its own database effectively. The difference in performance is extraordinary. A Splunk search involving a complex regex with busy indexers and search heads takes a fraction of the time to complete, compared with a similar scenario from other tools on the same hardware.

Three main ways to reduce events indexing costs with Splunk:

Root out useless events. Windows is the main culprit here, in particular Auditing of Objects. Do you need, for example, all that performance monitoring data? Debug events? Firewall AND NIDS events? Denied AND accepted packets from firewalls?
Develop your use cases (see above) and turn off all other logging. You can use filters to achieve this.
You can be highly selective about which events are forwarded to the Splunk indexer. One conceptual model just to illustrate the point is given below:

Threat Hunting

Threat Hunting is kind of the sexy offering for the world of defence. Offence has had more than its fair share of glamour offerings over the years. Now its defence’s turn. Or is it? I mean i get it. It’s a good thing to put on your profile, and in some cases there are dramatic lines such as “be the hunter or the hunted”.

However, a rational view of “hunting” is that it requires LOTS of resources and LOTS of skill – two commodities that are very scarce. Threat hunting in most cases is the worst kind of resources sink hole. If you take vulnerability management (TVM) and the kind of basic detection discussed thus far in this article, you have a defence capability that in most cases fits the risk management needs of the organisation. So then there’s two questions to ask:

How much does threat hunting offer on top of a suitably configured logging and TVM capability? Not much in the best of cases. Especially with credentialed scanning with TVM – there is very little of your attack surface that you cannot cover.
How much does threat hunting offer in isolation (i.e. threat hunting with no TVM or logging)? This is the worst case scenario that will end up getting us all fired in security. Don’t do it!!! Just don’t. You will be wide open to attack. This is similar to a TVM program that consists only of one-week penetration tests every 6 months.

Threat Intelligence (TI)

Ok so here’s a funny story. At a trading house client here in London around 2016: they were paying a large yellow vendor lots of fazools every month for “threat intelligence”. I couldn’t help but notice a similarity in the output displayed in the portal as compared with what i had seen from the client’s Alienvault. There is a good reason for this: it WAS Alienvault. The feeds were coming from switches and firewalls inside the client network, and clearly $VENDOR was using Alienvault also. So they were paying heaps to see a duplication of the data they already had in their own Alienvault.

The aforementioned is an extremely bad case of course. The worst of the worst. But can you expect more value from other threat intelligence feeds? Well…remember what i was saying about the value of an effective TVM and detection program? Ok I’ll summarise the two main problems with TI:

You can really achieve LOTS in defence with a good credentialed TVM program plus even a half-decent logging program. I speak as someone who has lots of experience in unrestricted penetration testing – believe me you are well covered with a good TVM and detection SecOps function. You don’t need to be looking at threats apart from a few caveats…see later.
TI from commercial feeds isn’t about your network. Its about the whole planet. Its like picking up a newspaper to find out what’s happening in the world, and seeing on the front cover that a butterfly in China has flapped its wings recently.

Where TI can be useful – macro developments and sector-specific developments. For example, a new approach to Phishing, or a new class of vulnerability with software that you host, or if you’re in the public sector and your friendly national spy agency has picked up on hostile intentions towards you. But i don’t want to know that a new malware payload has been doing the rounds. In the time taken to read the briefing, 2000 new payloads have been released to the wild.

Summary

Start out with a design process that takes input feeds from compliance and risk (perhaps threat modelling), use the resulting requirements to drive the LLD, which may or may not result in a decision to procure tech that meets the requirements of the LLD.
An effective logging capability can only be designed with intimate knowledge of the estate – databases, crown jewels, data flows – for each application. Without such knowledge, it isn’t possible to build even a barely useful logging capability. Call out your generic and custom use cases in your LLD, independent of technology.
Get your basic alerting first, correlation can come later, if ever.
Outsourcing is a waste of resources for second level SecOps.
With SaaS, your SIEM itself is dangerously exposed, and you have no control over what is logged from SaaS log sources.
You are not mandated to get a COTS. Think about what it is that you want to achieve. It could be that open source tools across the board work for you.
Splunk really is the Lamborghini of SIEMs and the “expensive” tag is unjustified. If you carefully design custom and generic use cases, and remove everything else from indexing, you suddenly don’t have such an expensive logger. You can also aggregate everything in a Syslog pool before it hits Splunk indexers, and be more selective about what gets forwarded.
I speak as someone with lots of experience in unrestricted penetration testing: Threat Hunting and Threat Intelligence aren’t worth the effort in most cases.

Fintechs and Security – Part Two

Posted on January 7, 2020 by itibble@gmail.com

Prologue – covers the overall challenge at a high level
Part One – Recruiting and Interviews
Part Two – Threat and Vulnerability Management – Application Security
Part Three – Threat and Vulnerability Management – Other Layers
Part Four – Logging
Part Five – Cryptography and Key Management, and Identity Management
Part Six – Trust (network controls, such as firewalls and proxies), and Resilience

Threat and Vulnerability Management (TVM) – Application Security

This part covers some high-level guider points related to the design of the application security side of TVM (Threat and Vulnerability Management), and the more common pitfalls that plague lots of organisations, not just fintechs. I won’t be covering different tools in the SAST or DAST space apart from one known-good. There are some decent SAST tools out there but none really stand out. The market is ever-changing. When i ask vendors to explain what they mean by [new acronym] what usually results is nothing, or a blast of obfuscation. So I’m not here to talk about specific vendor offerings, especially as the SAST challenge is so hard to get even close to right.

With vulnerability management in general, ${VENDOR} has succeeded in fouling the waters by claiming to be able to automate vulnerability management. This is nonsense. Vulnerability assessment can to some limited degree be automated with decent results, but vulnerability management cannot be automated.

The vulnerability management cycle has also been made more complicated by GRC folk who will present a diagram representing a cycle with 100 steps, when really its just assess –> deduce risk –> treat risk –> GOTO 1. The process is endless, and in the beginning it will be painful, but if handled without redundant theory, acronyms-for-the-sake-of-acronyms-for-the-same-concept-that-already-has-lots-of-acronyms, rebadging older concepts with a new name to make them seem revolutionary, or other common obfuscation techniques, it can be easily integrated as an operational process fairly quickly.

The Dawn Of Application Security

If you go back to the heady days of the late 90s, application security was a thing, it just wasn’t called “application security”. It was called penetration testing. Around the early 2000s, firewall configurations improved to the extent that in a pen test, you would only “see” port 80 and/or 443 exposing a web service on Apache, Internet Information Server, or iPlanet (those were the days – buffer overflow nirvana). So with other attack channels being closed from the perimeter perspective, more scrutiny was given to web-based services.

Attackers realised you can subvert user input by intercepting it with a proxy, modifying some fields, perhaps inject some SQL or HTML, and see output that perhaps you wouldn’t expect to see as part of the business goals of the online service.

At this point the “application security” world was formed and vulnerabilities were classified and given new names. The OWASP Top Ten was born, and the world has never been the same since.

SAST/DAST

More acronyms have been invented by ${VENDOR} since the early early pre-holocene days of appsec, supposedly representing “brand new” concepts such as SAST (Static Application Security Testing) and DAST (Dynamic Application Security Testing), which is the new equivalent of white box and black box testing respectively. The basic difference is about access to the source code. SAST is source code testing while DAST is an approach that will involve testing for OWASP type vulnerabilities while the software is running and accepting client connection requests.

The SAST scene is one that has been adopted by fintechs in more recent times. If you go back 15 years, you would struggle to find any real commercial interest in doing SAST – so if anyone ever tells you they have “20” or even “10” years of SAST experience, suggest they improve their creativity skills. The general feeling, not unjustified, was that for a large, complex application, assessing thousands of lines of source code at a vital organ/day couldn’t be justified.

SAST is more of a common requirement these days. Why is that? The rise of fintechs, whose business is solely about generation of applications, is one side of it, and fintechs can (and do) go bust if they suffer a breach. Also – ${VENDOR}s have responded to the changing Appsec landscape by offering “solutions”. To be fair, the offerings ARE better than 10 years ago, but it wouldn’t take much to better those Hello World scripts. No but seriously, SAST assessment tools are raved about by Gartner and other independent sources, and they ARE better than offerings from the Victorian era, but only in certain refined scenarios and with certain programming languages.

If it was possible to be able to comprehensively assess lots of source code for vulnerability and get accurate results, then theoretically DAST would be harder to justify as a business undertaking. But as it is, SAST + DAST, despite the extensive resources required to do this effectively, can be justified in some cases. In other cases it can be perfectly fine to just go with DAST. It’s unlikely ever going to be ok to just go with SAST because of the scale of the task with complex apps.

Another point here – i see some fintechs using more than one SAST tool, from different vendors. There’s usually not much to gain from this. Some tools are better with some programming languages than others, but there is nothing cast in stone or any kind of majority-view here. The costs of going with multiple vendors is likely going to be harder and harder to justify as time goes on.

Does Automated Vulnerability Assessment Help?

The problem of appsec is still too complex for decent returns from automation. Anyone who has ever done any manual testing for issues such as XSS knows the vast myriad of ways in which such issues can be manifested. The blackbox/blind/DAST scene is still not more than Burp, Dirbuster, but even then its mostly still manual testing with proxies. Don’t expect to cover all OWASP top 10 issues for a complex application that presents an admin plus a user interface, even in a two-week engagement with four analysts.

My preferred approach is still Fred Flinstone’y, but since the automation just isn’t there yet, maybe its the best approach? This needs to happen when an application is still in the conceptual white board architecture design phase, not a fully grown [insert Hipster-given-name], and it goes something like this: white board, application architect – zero in on the areas where data flows involve transactions with untrusted networks or users. Crpyto/key management is another area to zoom in on.

Web Application Firewall

The best thing about WAFs, is they only allow propagation of the most dangerous attacks. But seriously, WAF can help, and in some respects, given the above-mentioned challenges of automating code testing, you need all the help you can get, but you need to spend time teaching the WAF about your expected URL patterns and tuning it – this can be costly. A “dumb” default-configured WAF can probably catch drive-by type issues for public disclosed vulnerabilities as long as you keep it updated. A lot depends on your risk profile, but note that you don’t need a security engineer to install a WAF and leave it in default config. Pretty much anyone can do this. You _do_ need an experienced security engineer or two to properly understand an application and configure a WAF accordingly.

Python and Ruby – Web Application Frameworks

Web application frameworks such as Ruby on Rails (RoR) and Django are in common usage in fintechs, and are at least in some cases, developed with security in mind in that they do offer developers features that are on by default. For example, with Django, if you design a HTML form for user input, the server side will have been automagically configured with the validation on the server side, depending on the model field type. So an email address will be validated client and server-side as an email address. Most OWASP issues are the result of failures to validate user input on the server side.

Note also though that with Django you can still disable HTML tag filtering of user input with a “| safe” in the template. So it’s dangerous to assume that all user input is sanitised.

In Django Templates you will also see a CSRF token as a hidden form field if you include a Form object in your template.

The point here is – the root of all evil in appsec is server-side validation, and much of your server-side validation effort in development will be covered by default if you go with RoR or Django. That is not the end of the story though with appsec and Django/RoR apps. Vulnerability of the host OS and applications can be problematic, and it’s far from the case that use of either Django or RoR as a dev framework eliminates the need for DAST/SAST. However the effort will be significantly reduced as compared to the C/Java/PHP cases.

Wrap-up

Overall i don’t want to too take much time bleating about this topic because the take away is clear – you CAN take steps to address application security assessment automation and include the testing as part of your CI/CD pipeline, but don’t expect to catch all vulnerabilities or even half of what is likely an endless list.

Expect that you will be compromised and plan for it – this is cheaper than spending zillions (e.g. by going with multiple SAST tools as i’ve seen plenty of times) on solving an unsolvable problem – just don’t let an incident result in a costly breach. This is the purpose of security architecture and engineering. It’s more to deal with the consequences of an initial exploit of an application security fail, than to eliminate vulnerability.

Fintechs and Security – Part One

Posted on December 13, 2019 by itibble@gmail.com

Prologue – covers the overall challenge at a high level
Part One – Recruiting and Interviews
Part Two – Threat and Vulnerability Management – Application Security
Part Three – Threat and Vulnerability Management – Other Layers
Part Four – Logging
Part Five – Cryptography and Key Management, and Identity Management
Part Six – Trust (network controls, such as firewalls and proxies), and Resilience

Recruiting and Interviews

In the prologue of this four-stage process, I set the scene for what may come to pass in my attempt to relate my experiences with fintechs, based on what i am hearing on the street and what i’ve seen myself. In this next instalment, i look at how fintechs are approaching the hiring conundrum when it comes to hiring security specialists, and how, based on typical requirements, things could maybe be improved.

The most common fintech setup is one of public-cloud (AWS, Azure, GCP, etc), They’re developing, or have developed, software for deployment in cloud, with a mobile/web front end. They use devops tools to deploy code, manage and scale (e.g. Kubernetes), collaborate (Git variants) and manage infrastructure (Ansible, Terraform, etc), perhaps they do some SAST. Sometimes they even have different Virtual Private Clouds (VPCs) for different levels of code maturity, one for testing, and one for management. And third party connections with APIs are not uncommon.

Common Pitfalls

Fintechs adopt the stance: “we don’t need outside help because we have hipsters. They use acronyms and seem quite confident, and they’re telling me they can handle it”. While not impossible that this can work – its unlikely that a few devops peeps can give a fintech the help they need – this will become apparent later.
Using devops staff to interview security engineers. More on this problem later.
Testing security engineers with a list of pre-prepared questions. This is unlikely to not end in tears for the fintech. Security is too wide and deep an area for this approach. Fintechs will be rejecting a lot of good candidates by doing this. Just have a chat! For example, ask the candidate their opinions on the usefulness of VA scanners. The length of the response is as important as its technical accuracy. A long response gives an indication of passion for the field.
Getting on the security bandwagon too late (such as when you’re already in production!) you are looking at two choices – engage an experienced security hand and ignore their advice, or do not ignore their advice and face downtime, and massive disruption. Most will choose the first option and run the project at massive business risk.

The Security Challenge

Infosec is important, just as checking to see if cars are approaching before crossing the road is important. And the complexity of infosec mandates architecture. Civil engineering projects use architecture. There’s a good reason for that – which doesn’t need elaborating on.

Whenever you are trying to build something complex with lots of moving parts, architecture is used to reduce the problem down to a manageable size, and help to build good practices in risk management. The end goal is protective monitoring of an infrastructure that is built with requirements for meeting both risk and compliance challenges.

Because of the complexity of the challenge, it’s good to split the challenge into manageable parts. This doesn’t require talking endlessly about frameworks such as SABSA. But the following six capabilities (people, process, technology) approach is sleek and low-footprint enough for fintechs:

Threat and Vulnerability Management (TVM)
Logging – not “telemetry” or Threat intelligence, or threat hunting. Just logging. Not even necessarily SIEM.
Cryptography and Key Management
Identity Management
Business Continuity Management
Trust (network segmentation, firewalls, proxies).

I will cover these 6 areas in the next two articles, in more detail.

The above mentioned capabilities have an engineering and architecture component and cover very briefly the roles of security engineers and architects. A SABSA based approach without the SABSA theory can work. So an architect takes into account risk (maybe with a threat modelling approach) and compliance goals in a High Level Design (HLD), and generates requirements for the Low Level Design (LLD), which will be compiled by a security engineer. The LLD gives a breakdown of security controls to meet the requirements of the HLD, and how to configure the controls.

Security Engineers and Devops Tools

What happens when a devops peep interviews a security peep? Well – they only have their frame of reference to go by. They will of course ask questions about devops tools. How useful is this approach? Not very. Is this is good test of a security engineer? Based on the security requirements for fintechs, the answer is clear.

Security engineers can use devops tools, and they do, and it doesn’t take a 2 week training course to learn Ansible. There is no great mystery in Kubernetes. If you hire a security engineer with the right background (see the previous post in this series) they will adapt easily. The word on the street is that Terraform config isn’t the greatest mystery in the world and as long as you know Linux, and can understand what the purpose of the tool is (how it fits in, what is the expected result), the time taken to get productive is one day or less.

The point is: if i’m a security engineer and i need to, for example, setup a cloud SIEM collector: some fintechs will use one Infrastructure As Code (IaC) tool, others use another one – one will use Chef, another Ansible, and there are other permutations. Is a lack of familiarity with the tool a barrier to progress? No. So why would you test a security engineer’s suitability for a fintech role by asking questions about e.g. stanzas in Ansible config? You need to ask them questions about the six capabilities I mentioned above – i.e. security questions for a security professional.

Security Engineers and Clouds

Again – what was the transition period from on-premise to cloud? Lets take an example – I know how networking works on-premise. How does it work in cloud? There is this thing called a firewall on-premise. In Azure it’s called a Network Security Group. In AWS its called a …drum roll…firewall. In Google Cloud its called a …firewall. From the web-based portal UI for admin, these appear to filter by source and destination addresses and services, just like an actual non-virtual firewall. They can also filter by service account (GCP), or VM tag.

There is another thing called VPN. And another thing called a Virtual Router. On the world of on-premise, a VPN is a …VPN. A virtual router is a…router. There might be a connection here!

Cloud Service Providers (CSP) in general don’t re-write IT from the ground up. They still use TCP/IP. They host virtual machines (VM) instead of real machines, but as VMs have operating systems which security engineers (with the right background) are familiar with, where is the complication here?

The areas that are quite new compared to anything on-premise are areas where the CSP has provided some technology for a security capability such as SIEM, secrets management, or Identity Management. But these are usually sub-standard for the purpose they were designed for – this is deliberate – the CSPs want to work with Commercial Off The Shelf (COTS) vendors such as Splunk and Qualys, who will provide a IaaS or SaaS solution.

There is also the subject of different clouds. I see some organisations being fussy about this, e.g. a security engineer who worked a lot with Azure but not AWS, is not suitable for a fintech that uses AWS. Apparently. Well, given that the transition from on-premise to cloud was relatively painless, how painful is it to transition from Azure to AWS or …? I was on a project last summer where the fintech used Google Cloud Platform. It was my first date with GCP but I had worked with AWS and Azure before. Was it a problem? No. Do i have an IQ of 160? Hell no!

The Wrap-up

Problems we see in fintech infosec hiring represent what is most likely a lack of understanding of how they can best manage risk with a budget that is considerably less than a large MNC for example. But in security we haven’t been particularly helpful for fintechs – the problem is on us.

The security challenge for fintechs is not just about SAST/DAST of their code. The challenge is wider and be represented as six security capabilities that need to be designed with an architecture and engineering view. This sounds expensive, but its a one-off design process that can be covered in a few weeks. The on-going security challenge, whereby capabilities are pushed through into the final security operations stage, can be realised with one or two security engineers.

The lack of understanding of requirements in security leads to some poor hiring practices, the most common of which is to interview a security engineer with a devops guru. The fintech will be rejecting lots of good security engineers with this approach.

In so many ways, the growth of small to medium development houses has exposed the weaknesses in the infosec sector more than they were ever exposed with large organisations. The lack of the sector’s ability to help fintechs exposes a fundamental lack of skilled personnel, more particularly at the strategic/advisory level than others.

Security Macromorphosis

Ian Tibble's Security Blog

Category Archives: Devsecops