Development, Python

Azure Terraform Managed Infrastructure Calculator

I use terraform with Azure for all the obvious benefits that Infrastructure As Code provides. However, sometimes it makes much more sense to provision resources via the frontend portal, or via the AZ cli for speed, unsupported terraform resources or just for experimenting.
This got me thinking, how much of my deployed infrastructure is actually managed via terraform?

So i wrote a little proof of concept application using the Azure Python SDK. It has a dependency on managing your terraform state in an Azure cloud storage account (as is recommended by terraform).

Firstly, you’ll need to ensure that you set the following environment variables to a service principal which has appropriate permissions to view the resources you intend to query (including Azure Active Directory Groups).

export ARM_SUBSCRIPTION_ID="xxxx"
export ARM_CLIENT_ID="xxxxxx"
export ARM_CLIENT_SECRET="xxxxxxxx"
export ARM_TENANT_ID="xxxxxxxx"

The subscription id is in reference to the subscription where the storage account that contains the terraform state resides. An alternative method would be to generate a Shared Access Token (SAS) so that the storage account can be accessed regardless of subscription permissions.

You will also need to update config.json to include all the subscription IDs you want to scrape, as well as explicitly state the storage account name. This is just proof of concept code and could easily be improved to be more autonomous.

So once you’ve setup the variables and configuration file, you should be ready to run. Ensure the python packages are installed:

pip3 install azure-storage-blob azure-mgmt-storage>=3.0.0 msrestazure azure-identity

And then simply run the iacPercentageCalc.py application.

python3 iacPercentage.py
...
INFO - 180 resources in azure
INFO - 118 resources not in terraform
INFO - 34.44% managed by terraform

You can also run with a debug flag to improve verbosity to see exactly what resources are being compared.

python3 iacPercentage.py -l DEBUG
...
DEBUG - /subscriptions/xxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/xxxxxxxx/providers/Microsoft.Network/virtualNetworks/xxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
DEBUG - xxxxxxxx-xxxx-xxxxxxxxx-xxxxxxxxxxxx is present in terraform state
INFO - 180 resources in azure
INFO - 118 resources not in terraform
INFO - 34.44% managed by terraform

How Does it Work?
The application uses the Azure Graph API to generate an OAuth token.
Once the token is successfully generated, a HTTP request is fired to the Azure management API to retrieve all resources under the subscriptions specified in the configuration file.

All resources are queried for their `id` value which is added to an array. Once the subscriptions have been scraped, the same principal is applied to the Azure Graph application endpoint to retrieve all service principals and then to the groups endpoint to retrieve all Azure Active Directory group ids. So at the end of this process, we have all the id values of all resources (within subscriptions) and all app registrations and AAD groups stored in an array.
Now now all ID values are the same, some ID values are UUIDs and others are full URLs to the resource with a UUID embedded within it.

Once the Azure resoruces are retrieved, the application then needs to figure out what resources terraform knows about. The application uses the Azure SDK to connect to the specified storage account (terraform shared state) and retrieves the storage accounts first key (key1). This key is then used in a subsequent request to query the storage account.
The storage account is then queried and searches for all output values within the terraform state blob. If the output has an ID value (which they all should) then this value is stored into an array. The terraform state is recursively iterated to capture all ID values.

Now the terraform state does not delete values if they are no longer used. For example, if i created a terraform resource called “test” but then deleted the resource via the portal, the state file would still have a record of that resource, it would just be outdated.

Finally, the application iterates through all the azure resources and takes each ID value and see’s if its in the terraform list. If the entry is not in the terraform list, then terraform dosent know anything about it so its not provisioned via terraform. The unmanaged list size is compared against the all azure resource list and a percentage is derived.

Working in the cloud provides new opportunities for information assurance and this is a cool little idea to help provide better insight into your infrastrucure. What you have deployed is not the same as what you think you have deployed, so this application helps to address those gaps.

Full code is available here:
https://gitlab.com/ashleykingscote/azure-iac-calc

Standard
Miscellaneous, Security

What is Personal Data? (3/3)

fter consolidating just those 3 data sources, i have a HUGE amount on information on the residents of Reading. Obviously the results are only as good as the data. The companies house data is making some big assumptions, but its useful as another piece of information to piece together the picture.

Install the following prereqs:

sudo apt-get install -y curl
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get install -y gdal-bin

Run this script to install the Open Street Maps land polygons – https://raw.githubusercontent.com/go-spatial/tegola-osm/master/osm_land.sh

Download postgis_helpers.sql and install

wget https://raw.githubusercontent.com/go-spatial/tegola-osm/master/postgis_helpers.sql
psql -U ashley -d gis -a -f postgis_helpers.sql

Download postgis_index.sql and install

wget https://raw.githubusercontent.com/go-spatial/tegola-osm/master/postgis_index.sql
psql -U ashley -d gis -a -f postgis_index.sql

Download https://raw.githubusercontent.com/go-spatial/tegola-osm/master/natural_earth.sh and run it
It will download the natural_earth vectors and put them in the `natural_earth` database, so that stuff actually renders.

Make sure you have the GIS extensions enabled on the database youre using for the public data.

sudo -u postgres psql -d public_data -c 'CREATE EXTENSION postgis; CREATE EXTENSION hstore;'

Here is an example of the table schema i used:

sudo -u postgres psql public_data

CREATE TABLE example ( 
  id  INTEGER,
  name VARCHAR,
  latitude FLOAT,
  longitude FLOAT,
  geom geometry(Point, 4326),
  tags HSTORE,
  CONSTRAINT example_pky PRIMARY KEY (id)
);

Make sure that the permissions are set. A good test to check that it is working is to View data (not just columns) in pgadmin

GRANT CONNECT ON DATABASE public_data TO ashley;
GRANT ALL PRIVILEGES ON TABLE example TO ashley;

Now insert a record with some lat and long data

INSERT INTO example(id, name, latitude, longitude, tags) VALUES (1, 'reading station', 51.455893, -0.971474, '"name" => "reading station"');

Now make the `geometry` field be the output of conversion of the latitude and longitude fields. Here im using the Point geometry type

UPDATE example SET geom = ST_MakeValid(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326));

This query is magic and will convert all lat/long fields into the geometry type required by tile servers. The 4326 signifies the specific type of geometry required by tegola.

Here is an example before/after:

public_data=# INSERT INTO example(id, name, latitude, longitude) VALUES (1, 'reading station', 51.455893, -0.971474);
INSERT 0 1
public_data=# SELECT * FROM example;
 id |      name       | latitude  | longitude | geom 
----+-----------------+-----------+-----------+------
  1 | reading station | 51.455893 | -0.971474 | 
(1 row)

public_data=# UPDATE example SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
UPDATE 1
public_data=# SELECT * FROM example;
 id |      name       | latitude  | longitude |                        geom                        
----+-----------------+-----------+-----------+----------------------------------------------------
  1 | reading station | 51.455893 | -0.971474 | 0101000020E6100000FA415DA45016EFBFD8BCAAB35ABA4940
(1 row)

Now i have the geometry value populated, ill see if i can get it rendered on the regola server.

I installed pgadmin and connected it to the postgresql database as its much easier to manage the data with a GUI.

If you run into a problem, you can delete everything from a table in postgres using truncate

truncate table example;

So after i uploaded my data from my sqlite database into postgres (companieshouseofficers, landregistry & openregister tables), i just needed to update config.toml file so that tegola can render the layers seperately.

Now obviously, i could do some SQL table merges and cross reference on names/addresses, but i dont really need to do that. Im not actually interested in finding specific information on a person, i just want to demonstrate the vast information available.

Again, i feel i need to reiterate something: this is open data! Ive sourced this with minimal effort – in the scope of hours. Its not worth thinking about what any sort of motivated individual can source.

Im not interested in plotting every field available on each data source, just a select few.
For the open register, im extracting just the names (and geometry obviously)
For the companies house officers, im extracting name, company name and geometry.
For the pricePaid (land registry) data, im extracting price paid, transaction date, address and geometry.

For completeness, here is my config.toml file:

[webserver]
port =  ":8080"

[[providers]]
name = "osm"
type = "postgis"
host = "localhost"
port = 5432
database = "gis"
user = "ashley"
password = "ashley"

	[[providers.layers]]
	name = "land_8-20"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid FROM land_polygons WHERE wkb_geometry && !BBOX!"

	# Water
	[[providers.layers]]
	name = "water_areas"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank', 'dock') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen0 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 1000000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen0_6"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen0 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 100000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen1 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 1000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines WHERE type IN ('river', 'canal', 'stream', 'ditch', 'drain', 'dam') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines_gen0 WHERE type IN ('river', 'canal') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines_gen1 WHERE type IN ('river', 'canal', 'stream', 'ditch', 'drain', 'dam') AND geometry && !BBOX!"

	# Land Use
	[[providers.layers]]
	name = "landuse_areas"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas WHERE geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen0 WHERE type IN ('forest','wood','nature reserve', 'nature_reserve', 'military') AND area > 1000000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen0_6"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen0 WHERE type IN ('forest','wood','nature reserve', 'nature_reserve', 'military') AND area > 100000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen1 WHERE geometry && !BBOX!"

	# Transport

	[[providers.layers]]
	name = "transport_lines_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, type, tunnel, bridge, ref FROM osm_transport_lines_gen0 WHERE type IN ('motorway','trunk','motorway_link','trunk_link','primary') AND tunnel = 0 AND bridge = 0  AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, ref, class, type FROM osm_transport_lines_gen1 WHERE type IN ('motorway', 'trunk', 'primary', 'primary_link', 'secondary', 'motorway_link', 'trunk_link') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_11-12"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE type IN ('motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary', 'primary_link', 'secondary', 'secondary_link', 'tertiary', 'tertiary_link', 'rail', 'taxiway', 'runway', 'apron') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_13"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE type IN ('motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary', 'primary_link', 'secondary', 'secondary_link', 'tertiary', 'tertiary_link', 'rail', 'residential', 'taxiway', 'runway', 'apron') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_14-20"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE geometry && !BBOX!"

	# Buildings
	[[providers.layers]]
	name = "buildings"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, nullif(as_numeric(height),-1) AS height, type FROM osm_buildings WHERE geometry && !BBOX!"

#	Natural Earth
[[providers]]
name = "ne"
type = "postgis"
host = "localhost"
port = 5432
database = "natural_earth"
user = "<username>"
password = "<password>"

	[[providers.layers]]
	name = "ne_110m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_110m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_50m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_50m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_10m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_roads_3"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, min_zoom, min_label, type, label FROM ne_10m_roads WHERE min_zoom < 5 AND type <> 'Ferry Route' AND wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_roads_5"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, min_zoom, min_label, type, label FROM ne_10m_roads WHERE min_zoom <= 7  AND type <> 'Ferry Route' AND wkb_geometry && !BBOX!"

# custom provider
[[providers]]
name = "public_data"
type = "postgis"
host = "localhost"
port = 5432
database = "public_data"
user = "ashley"
password = "ashley"

	[[providers.layers]]
	name = "openregister"
    geometry_type = "Point"
	id_fieldname = "id"
    srid = 4326
	sql = "SELECT ST_AsBinary(geom) AS geom, name, id FROM openregister WHERE geom && !BBOX!"

	[[providers.layers]]
	name = "companiesHouseOfficers"
    geometry_type = "Point"
	id_fieldname = "id"
    srid = 4326
	sql = "SELECT ST_AsBinary(geom) AS geom, name, company_name, id FROM companiesHouseOfficers WHERE geom && !BBOX!"


	[[providers.layers]]
	name = "pricePaid"
    geometry_type = "Point"
	id_fieldname = "id"
    srid = 4326
	sql = "SELECT ST_AsBinary(geom) AS geom, value, transaction_date, address1, id FROM landRegistry WHERE geom && !BBOX!"

[[maps]]
name = "osm"
attribution = "OpenStreetMap" 
center = [-0.98403, 51.44507, 13.0] #Reading

	# Land Polygons
	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_110m_land"
	min_zoom = 0
	max_zoom = 2

	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_50m_land"
	min_zoom = 3
	max_zoom = 4

	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_10m_land"
	min_zoom = 5
	max_zoom = 7

	[[maps.layers]]
	name = "land"
	provider_layer = "osm.land_8-20"
	dont_simplify = true
	min_zoom = 8
	max_zoom = 20

	# Land Use
	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen0"
	min_zoom = 3
	max_zoom = 5

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen0_6"
	min_zoom = 6
	max_zoom = 9

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen1"
	min_zoom = 10
	max_zoom = 12

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas"
	min_zoom = 13
	max_zoom = 20

	# Water Areas
	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen0"
	min_zoom = 3
	max_zoom = 5

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen0_6"
	min_zoom = 6
	max_zoom = 9

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen1"
	min_zoom = 10
	max_zoom = 12

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas"
	min_zoom = 13
	max_zoom = 20

	# Water Lines
	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines_gen0"
	min_zoom = 8
	max_zoom = 12

	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines_gen1"
	min_zoom = 13
	max_zoom = 14

	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines"
	min_zoom = 15
	max_zoom = 20

	# Transport Lines (Roads, Rail, Aviation)
	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "ne.ne_10m_roads_3"
	min_zoom = 3
	max_zoom = 4

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "ne.ne_10m_roads_5"
	min_zoom = 5
	max_zoom = 6

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_gen0"
	min_zoom = 7
	max_zoom = 8

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_gen1"
	min_zoom = 9
	max_zoom = 10

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_11-12"
	min_zoom = 11
	max_zoom = 12

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_13"
	min_zoom = 13
	max_zoom = 13

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_14-20"
	min_zoom = 14
	max_zoom = 20

	# Buildings
	[[maps.layers]]
	name = "buildings"
	provider_layer = "osm.buildings"
	min_zoom = 14
	max_zoom = 20

	[[maps.layers]]
	name = "openregister"
	provider_layer = "public_data.openregister"
	min_zoom = 14
	max_zoom = 20

	[[maps.layers]]
	name = "companiesHouseOfficers"
	provider_layer = "public_data.companiesHouseOfficers"
	min_zoom = 14
	max_zoom = 20

	[[maps.layers]]
	name = "pricePaid"
	provider_layer = "public_data.pricePaid"
	min_zoom = 14
	max_zoom = 20

Notice that ive set the zoom layers so something sensible, as i dont want all data points to render immediately. When i zoom in past level 14 on a specific area, thats when my data comes into life.

So now, lets fire up the server.

./tegola serve --config=config.toml

Open up a browser on localhost:8080 and i can see that the Reading roads have rendered!

The second i zoom in, i can see a huge amount of information.

In the interest of complete visibilty, i goofed up. For some reason, i tried sorted the data in excel for the land registry data and i ended up completely misalligning the data columns for latitude. Rather than re-running the entire dataset (the HUGE dataset) through the google API again (i’d have to sign up for a free trial) i ended up just using a smaller dataset that i had saved from a dry run. So for the land registry data, i only am using 17,000 records (as opposed to 158,225 initially retrieved). So i am plotting barely 10% of the land registry data i could have – so there should be a LOT more yellow dots.

By using the Inspect Features utility at the bottom, i can hover over a data point and see the information ive extracted by the query. Here i have deselected companiesHouseOfficers and pricePaid and just left the open register data on. Ive zoomed in on a random area near the town center.
When you look at data in a database or spreadsheet, i find that it dosent really mean much to you. 1000 records in a database is kind of difficult to realise just how powerful that information is. By plotting on a map, I feel you can get more of a feel for what that data means and how powerful that information is.
Ive purposely not shown the above people specific address on the screenshot, but the point is I could have. I have that data as it is free and open. Again, the whole point in this series is to try and understand whether or not this IS private information? Evidently – to these people its not.

Ive put together a short video capture to show you around the dataset ive captured.
https://youtu.be/-lDb1QhGC1g

Whats really cool is that the address conversion to latitude/longitude (via google api) to geometry (via SQL query) has worked brilliantly! In the video, you can see that i scroll the cursor up and down a street and the house numbers increase/decrease as expected. Its managed to plot the data quite accurately with nothing but an address!

Upon reflection, i wish i did re-capture all of that land registry data as it would have shown much more data points, however i think ive proved my point.
If i had the time, i’d love to pursue this project further and include more datasets and even perform some analysis on the data.

I think that your address cannot be personal information. Whether you know it or not, your address and name association is stored on hundreds of servers and that information is not hashed (only bank information/passwords are typically). I dont believe that every company that has your information has your best interests, or even appropriate security controls. If you expect to keep that information private, its never going to happen. I feel the same way about email addresses. The issue is that you can derive a lot of secondary information about an invidividual from their address and email address.

This could go down a huge rabbit hole, but i think ill leave it there.

Standard
Development, Golang, Miscellaneous

What is Personal Data? (2/3)

Continuing on from my first post, this post focuses on sourcing and massaging my data into a usable format ready for cross referencing and some analysis.
As detailed in my first post, i received 38,742 records of names and addreses across 52 different files of people on the open register in Reading.

The columns in the data set are: “Elector Number, Prefix, Elector Number,Elector Number Suffix,Elector Markers,Elector DOB,Elector Name,PostCode,Address1,Address2,Address3,Address4,Address5,Address6”.
The DOB field is empty for nearly all the records.

Im increasingly using Golang, so i thought it would be a good idea to process this data to consoliate into a single table in a sqlite database.
Ill dump some of the code here, but it was just strung together quickly. This may not work entirely as ive cut and pasted from multiple files.

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"database/sql"
	"io"
	"github.com/360EntSecGroup-Skylar/excelize"
)


type ElectoralRow struct {
	Area                string
	ElectorNumberPrefix string
	ElectorNumber       int
	ElectorNumberSuffix int
	ElectorMaker        string
	ElectorDOB          string
	ElectorName         string
	PostCode            string
	Address1            string
	Address2            string
}

func main() {
	database := createTables()
	addElectoral(database)
}


createTables() (sql.DB){
	db, _ := sql.Open("sqlite3", "./database.sqlite")
	statement, _ := db.Prepare(`CREATE TABLE IF NOT EXISTS openReg (id INTEGER PRIMARY KEY,
		 area TEXT, electorNumberPrefix TEXT, electorNumber INTEGER, electorNumberSuffix INTEGER,
		 electorMaker TEXT, electorDOB TEXT, electorName TEXT, postCode TEXT, address1 TEXT, address2 TEXT)`)
	statement.Exec()
   return *db
}

func addElectoral(database sql.DB){
	var openRegDir string = "C:\\Users\\ashkingie\\Desktop\\goLearn\\src\\openReg\\"
	files, err := ioutil.ReadDir(openRegDir)
	if err != nil {
		log.Fatal(err)
	}
	for _, file := range files {
		var fullPath string = fmt.Sprintf("%s%s", openRegDir, file.Name())
		fmt.Println("Opening ", fullPath)

		if fullPath[len(fullPath)-4:] != "xlsx" {
			fmt.Println("BREAKING - ", fullPath)
			break
		}

		f, err := excelize.OpenFile(fullPath)
		if err != nil {
			fmt.Println("ERROR opening file", fullPath, err)
			return
		}

		var sheetName string = "Sheet1"
		rows, err := f.GetRows(sheetName)
		
		if err != nil {
			fmt.Println("error getting sheet", sheetName, fullPath)
		}
		
		for _, excelRow := range rows {
			data := electoralRowToStruct(file.Name(), excelRow)
			_ = electoralRowToStruct(file.Name(), excelRow)
			insertOpenReg(database, data)
		}
		
	}
}


func electoralRowToStruct(fileName string, excelRow[] string) (ElectoralRow){
	electorNumber, err := strconv.Atoi(excelRow[1])
	if err != nil {
		electorNumber = 0
	}

	electorNumberSuffix, err := strconv.Atoi(excelRow[2])
	if err != nil {
		electorNumberSuffix = 0
	}
	fileName = processElecFilename(fileName)
	//make address uppercase for consistency

	rowData := ElectoralRow{
		Area:                fileName,
		ElectorNumberPrefix: excelRow[0],
		ElectorNumber:       electorNumber,
		ElectorNumberSuffix: electorNumberSuffix,
		ElectorMaker:        excelRow[3],
		ElectorDOB:          excelRow[4],
		ElectorName:         excelRow[5],
		PostCode:            strings.ToUpper(excelRow[6]),
		Address1:            strings.ToUpper(excelRow[7]),
		Address2:            strings.ToUpper(excelRow[8]),
	}
	return rowData
}

func processElecFilename(filename string) (string) {
	if filename[len(filename)-5:] == ".xlsx" {
		filename = filename[:len(filename)-5]
	}
	split := strings.Split(filename, "-")
	filename = split[1]
	return filename
}

func insertOpenReg(database sql.DB, row dataparse.ElectoralRow){
	statement, _ := database.Prepare("INSERT INTO openReg (area, electorName, address1, postCode) VALUES (?, ?, ?, ?)")
	statement.Exec(row.Area, row.ElectorName, row.Address1, row.PostCode)
}

Once the data is in a single table, it’s much easier to manage for processing as its in a consistent format.
The next step is the most interesting. How on earth do i plot an address onto a map?

Post GIS uses the concept of a Geometry –
So i need to convert an address into a Geometry. The way to do that is to first convert an address into a latitude and longitude.
The way I decided to do this was to use Google Geocoding API – you pass it an address and it will return the latitude and longitude.

Google offer a free trial of their API, you get $300 dollars free for 12 months and no autocharge after the trail ends.

Looking at their pricing sheet, it only costs a couple of dollars for 40K requests, so even if i exceed the free trial i shouldnt get a huge bill.
Sign up for an API key, it must be type Server key.
I wont post the full code here, but in a nutshell, i signed up and got an API key. Using Golang, i iterate through every record and make sure that i have house number, road and postcode information. If i have that information, i just use the golang HTTP client and send a request to the API. I process the response and store the values in a “latitude” and “longitude” column for that record in the sqlite database.

const apiKey = "api key as a string here"
const urlBase = "https://maps.googleapis.com/maps/api/geocode/json?"

//if you dont have the address, you can provide the postcode and return the address
//provide a URL and it will get the data using the API key. 
//throttling is implemented so maybe add a delay after each request
func GetLatLong(houseNumber string, road string, postcode string) (GeoCodeResponse){
    //https://maps.googleapis.com/maps/api/geocode/json?address=130%StSaviours Road%20Reading%20UK&key=<key_here>
    base, _ := url.Parse(urlBase)
    params := url.Values{}

    if houseNumber == "0" {
        params.Add("address", fmt.Sprintf("address=%s %s", road, postcode))
    } else {
        params.Add("address", fmt.Sprintf("address=%s %s %s", houseNumber, road, postcode))
    }
    params.Add("key", apiKey)
   
    base.RawQuery = params.Encode() 

    gourl := base.String()
    fmt.Println("URL IS", gourl)
    resp, err := http.Get(gourl)
    if err != nil {
        log.Error("error getting data from ", gourl, err)
    }
    
    var response GeoCodeResponse
    if resp.StatusCode != 200 {
        log.Error("didnt receive 200 from url", gourl)
        } else {
            defer resp.Body.Close()
            body, error := ioutil.ReadAll(resp.Body)
            if error != nil {
                log.Error("error reading body of data ", error)
            }
            if err := json.Unmarshal(body, &response); err != nil {
                log.Error("error unmarshalling response to struct", err)
            }
            
            // fmt.Printf("%+v\n", response)
        }
    
    return response
    }

There is API throttling in place, so I put a small sleep in place after each iteration although its not in this code snippet as it was cut and pasted.

The results are amazing, the geocoding API returns the coordinates with fantastic accuracy. I tested on my own property and it was spot on.
So now i have 38K names and addresses in reading, with lat/long coordinates. Im starting to feel creepy – even though this is OPEN data.

My next task was to repeat the process for some extra data sources. There are tonnes of sources out there and i can sign up for infinite free trials, but this really is just a proof of concept. The next data source im going to mine is going to be Companies House.

According to the National Office of Statistics – there are more than 5 million people in the UK who are self employed. Anyone who is a contactor, works for themselves and is self employed.
ONS – Coronavirus and self-employment in the UK
This means that anyone who is self employed or a contractor, will likely have their information on companies house.

So what i did was download the companies house data from their website –

The data format for companies house was excel files. It was HUGE. I got 6 .xlsx files each 400MB.
The information that I received was the companies registered address, whether or not the company was active, whether their accounts were up to date, the category of the company and the most crucial bit of information – the URL to their record online.

Similar to with the open register data, i wrote a go programme that opened each file, iterated through each records and added to an sqlite table when the data looked usable (was in Reading and had the relevant fields populated).
In the end, i found about 22K records that were usable for Reading.

Now it turns out that Companies House has an API and you can sign up for an API key.

If you go to the a companies record (once authenticated) you can add the “/officers” suffix and it will list the name and address of all registered officers for that company. So my 22K was about to explode in size. Each company would surely have at least 2 officers registered and hopefully they are local to the area.

I just registered with companies house, then go to “your applications” and it will list your API key.

curl -uYOUR_APIKEY_FOLLOWED_BY_A_COLON: https://api.companieshouse.gov.uk/company/{company_number}
curl -u xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: https://api.companieshouse.gov.uk/company/12279804

Here is an example resposne for company 12279804

{
   "links":{
      "filing_history":"/company/12279804/filing-history",
      "persons_with_significant_control":"/company/12279804/persons-with-significant-control",
      "self":"/company/12279804",
      "officers":"/company/12279804/officers"
},
   "company_name":"MY LANE CATERING LTD",
   "has_insolvency_history":false,
   "confirmation_statement":{
      "next_made_up_to":"2020-10-23",
      "next_due":"2020-11-06",
      "overdue":false
},
   "sic_codes":[
      "56103",
      "56210"
],
   "type":"ltd",
   "registered_office_address":{
      "locality":"Reading",
      "country":"United Kingdom",
      "address_line_1":"375 Oxford Road",
      "postal_code":"RG30 1HA"
   
},
   "etag":"9ce8a74fbcfd6acd13fd6634d3ab18fc09fdd330",
   "jurisdiction":"england-wales",
   "accounts":{
      "next_made_up_to":"2020-10-31",
      "next_due":"2021-07-24",
      "next_accounts":{
         "due_on":"2021-07-24",
         "period_end_on":"2020-10-31",
         "overdue":false,
         "period_start_on":"2019-10-24"
      
},
      "last_accounts":{
         "type":"null"
},
      "accounting_reference_date":{
         "day":"31",
         "month":"10"
},
      "overdue":false
},
   "company_status":"active",
   "undeliverable_registered_office_address":false,
   "registered_office_is_in_dispute":false,
   "company_number":"12279804",
   "has_charges":false,
   "date_of_creation":"2019-10-24",
   "can_file":true
}

Appending the /officers shows me much more interesting information.

curl -u xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: https://api.companieshouse.gov.uk/company/12279804/officers
...
{
  "inactive_count": 0,
  "items_per_page": 35,
  "start_index": 0,
  "kind": "officer-list",
  "items": [
    {
      "links": {
        "officer": {
          "appointments": "/officers/BIIB3OKYK2_8edq-Ue_V9k2cxtU/appointments"
        }
      },
      "address": {
        "country": "United Kingdom",
        "premises": "375",
        "locality": "Reading",
        "address_line_1": "Oxford Road",
        "postal_code": "RG30 1HA"
      },
      "name": "KING, Louis",
      "appointed_on": "2019-10-24",
      "officer_role": "secretary"
    },
    {
      "country_of_residence": "United Kingdom",
      "links": {
        "officer": {
          "appointments": "/officers/5k9v8g6jEtW9JcJZARxfaF67hhI/appointments"
        }
      },
      "date_of_birth": {
        "year": 1983,
        "month": 10
      },
      "nationality": "British",
      "officer_role": "director",
      "name": "KING, Louis",
      "appointed_on": "2019-10-24",
      "address": {
        "postal_code": "RG30 1HA",
        "country": "United Kingdom",
        "locality": "Reading",
        "premises": "375",
        "address_line_1": "Oxford Road"
      },
      "occupation": "Director"
    }
  ],
  "resigned_count": 0,
  "active_count": 2,
  "links": {
    "self": "/company/12279804/officers"
  },
  "total_results": 2,
  "etag": "88d52f539f66c5198500f3af3729a02a199f331b"
}

Now it may seem a bit tight posting this information here, but remember this is the whole point in this post. Its PUBLIC and OPEN information.
Louis King was born in October 1983 and has a business (and potentially lives at) 375 Oxford Road.

So what I did, was i ammended the go programme to go each records URL and grab the officer data for each company. I then added that to my SQLite database. Companies House API also have throttling on their API so i had to put a delay in and run the programme over night.
For companies house data alone, i got 112,086 records. This details all the companies registered in Reading and their addresses. It also lists the officers names, date of births and addresses – although some of these are not in Reading.

Ammending my code as i did before, i passed it through the geocode API. Unfortunately I had SO many records that i quickly used up my $300 free trail.

So eventually i my programme returned access denied when trying to geocode the data. I decided that i would just leave it there as its just about proving a concept and not doing this properly. The geocoding took ages and unfortunately i made a huge mistake. I was writing the data to a CSV rather than to the database directly – purely out of laziness so that i could reuse some code. I opened up the CSV and sorted it and somehow messed up the ordering!! Fortunately i had a smaller subset of the data – about 20K records where i had previously tested the API. I invested a lot of time trying to undo my mistake but unfortunately the data was unusable. I considered signing up for another free trial key and re-running but decided that it wasnt worth it. 20K records will do.

My final data set for this proof of concept (for now) is going to be the land registry.

What i did was go onto the land registry website and download all the data for Reading.

Interestingly, you can pay and get information on a specific property – around £3 per house.

If i had the budget, i would scrape this site and get all the property deeds in the Reading. This would definately list sensitive information such as who owns the house, when they brought it and their land boundries. Again, this is all public and open data, you are just paying an admin fee to access it.

Again, similar with the Open Register and Companies House data, i put it into an sqlite table and processed it through the geocoding API. The land regsitry gives me an address, the price paid and the date of the transaction. Not the most interesting information and many people know about this data source, but my thought is that its just an easy and accessible source. Its just one of many sources that when used in combination may reveal some interesting insights into how much data is PUBLIC.

By the end of this little mining exercise, i have 112,086 records from companies house, 38,741 records from the open register and 158,225 records from the land regsitry.
I tried breifly digging into the twitter API but its changed a lot since i last used it (see my making millions post). I managed to get it working, but the legacy data is behind a paywall and i just done have the budget. If i did, i could get the geodata from a tweet and plot that on my map as well.

So ive used three OPEN and FREE (apart from small admin fees) datasets. If i had the budget or wasnt so moral, I reckon i could definately mine some serious data. I imagine there are plenty of companies/state sponsored activitives that do just this.

Thats enough for this post, in the next post i’ll show just what this looks like when its all cross referenced.

Standard
Miscellaneous, Play, Security

What is Private Data? (1/3)

I found myself recently wondering, what exactly is private data? What i mean is, how much information about me is too much for a stranger to know.
Of course there is Personally Identifiable Information (PII) which is any data that could specify an individual, but thats not really what i mean. Im interested in what information is sensitive for me to give away, not just something that can identify me.

This project came about when I noticed a friend of mine was tearing their parcel label off of their package that they had received from Amazon. I asked why they were doing that and they replied “i dont want people knowing where i live”. My initial thought was that mindset is naive – surely if “people” want to figure out where you live, it wouldnt be that hard?
So I got thinking, is your address personal information?
My friend obviously seems to think so – he dosent want people knowing where he lives. So where do you draw the line?
Full Name? Address? Email Address? Phone Number? Job? Salary? What car you drive? Financial info?

Nowadays, most people have a digital footprint. People are certainly getting wiser to their online privacy, but definately not everyone. A lot of people simply dont care and why should they? As long as you dont become a victim of identity theft and if you have nothing to hide, then whats the harm? The obvious answer is that you should be as conservative as possible about what information you leave accessible – at least thats what most cyber professionals would say.
Personally, im not so sure. We live in the information age, the whole world is data driven. There are amazing progresses being made every day and i quite like the idea of a digital footprint. Something for me to leave behind, an insight into the lives of people that has never been available before. However there is a limit and I dread to think what information is gather about me unknowingly.


The above screenshot shows my google timeline from 5 years ago. It shows where i went that day, pretty much to the minute. That information is secure in my google account (at least i hope) and is not publicly accessible. Whats weird is that is just one department of one company gathering information on me, let alone the hundreds of systems i inadvertently interact with every day. Personally, i really like this feature which is why i leave it on. Im sure there are plently of people out there that hate the idea of being tracked. Ive also noticed google and facebook analysing my photos and automatically tagging peole i know.

So what information do I consider sensitive? Obviously my passwords, but personally I think thats about it. If I really wanted to be private, I would not have a smart phone, social media, email, mobile banking or any online prescense at all. I think as soon as you sign up to the information age, you give up your privacy. Unfortunately many people dont realise this and perhaps share too much.

I think that my daily movements is sensitive information. I think that my salary is sensitive information. I think that my address, my email address, my job, my name cannot really be sensitive information. Personally, thats my demarkation line. I order items online, so whoever im ordering from will have my address. It could be a sketchy warehouse in China with their government compiling a database or there could be a huge postal service conspiracy that stores everthing that goes through it. If you send out a CV, does it not contain a lot of sensitive information (work history, education, address, mobile number) that you send to unknown “recruiters”.
The second i order things online, im giving my address to people I dont know. So i really do think that ripping your address off a parcel is a bit silly. If a somebody was going through your bin, then they are likely already at your house and would know your address. Or they are at a tip and can determine that you ordered something from Amazon – so what? I dont think thats particularly sensitive information (at least for me).
However, I do also think that this is circumstantial and depends on individuals. For example, what if you were fleeing abuse, or a teacher or had a unfavourable job? You certainly wouldnt want your address being plastered online, nor your email.

So if your name and address isnt really that sensitive (for the majority of people), why isnt it easily publicaly available? Why cant I view who lives in a house when i zoom in on a property?
Something that I dont think is being done, is trying to see how much information can be gathered by cross referencing multiple sources of data. One source on its own might not be frightening, but combining with numerous other sources really could lead to more information than we are comfortable with being shared.

So harnessing my newly found knowledge of Tegola and GIS, i thought i would try and piece together some information using public and open data sources and plot them on a map of my local area. There are infinite data sources out there, but im going to focus just on what i can ethically mine. Ive quickly realised that in order to get into the really juicy datasets, companies (facebook, linkedin, twitter) have put up paywalls to capitalise on their APIs. Unfortunately, i dont have the budget to tap into these behmouths which is such a shame as i think it could provide some valuable (and scary) results. If I had say £3K of funding, i think I would be able to gather information on thousands of people in my area.

So what sources are public, open and ethical?
Im after information in Reading Berkshire where I live. To start with, the Open Register lists the names and addresses of individuals who are registered to vote and agree to be put onto the Open Register. The government/council keep a closed register with everyones name and address listed which is used by the council and authorities.

The open register is an extract of the electoral register, but is not used for elections. It can be bought by any person, company or organisation. For example, it is used by businesses and charities to confirm name and address details. The personal data in the register must always be processed in line with data-protection legislation.
http://www.legislation.gov.uk/en/uksi/2013/3198/schedule/3/chapter/2/made

The open register is my first place to go. The official legislation says that it can be brought by anyone. However, on the Reading council website they say “only a company” may purchase a copy. After some persuasion that legally they cant deny me access, i was sent over a copy after paying an administration fee.

I was sent the information for the whole of Reading and using Golang, i opened each file and processed them into a single CSV. In the end, I got 38,742 entries of individuals and their addresses from the Open Register – so about 20% of the population of the town.

The Reading council also list their publically available registers – https://www.reading.gov.uk/publicregisters
One of the most interesting data sources is the Houses of Multiple Occupancy (HMO) – https://www.reading.gov.uk/media/3768/HMO-register/pdf/HMO_REG_11032020.pdf
Meaning, if your name is on there you are a legal landlord and own the property which is listed that you are renting out to multiple tenants.

As well as this, the governement also list open data sources – https://data.gov.uk/
Already, there are thousands of potential sources that can be used, without even trying to touch the social media platforms.

I’ll pause this post here as i think this is a nice introduction into my project. The next two posts will show some of the technical details on what ive been doing.

Standard
Development, Miscellaneous, Play

Setting Up a Tegola Server with OpenStreetMap data and custom points

This post details the steps required to run your own tile server and import data from open street map.
I’ll also show you how to import your own points data, so you can plot layer data into the server.
This post is a prerequisite for a project i have and ill post more information about that as it matures.

The final result is something like this:

I googled the coordinates for Reading station and plotted a point on the map. Information has automatically been collated from the OpenStreetMap data about that specific source.

There is a miriad of tileserver implementations out there, but the one that caught my eye was tegloa.
There are some steps on the tegola website, but it seems that the site has gone into a bit of disrepair. Often on the site it refrences osm.tegola.comm which 404s for me and their example clients reference openlayer javascript CDNs that also 404. However, the source code is on github so i figured worst case i could take a ganders as im getting pretty confident with my Go.

In all these examples, im just using the username password combination of “ashley:ashley” and am just running things locally.
So you will see hardcoded references to my passwords, dont get excited, my password isnt usually my name!

Tegola uses PostGIS which is a Geographic Information System addon for PostgreSQl. So the first dependency is PostgreSQl. My environment for all of this isUbuntu 18.04 Desktop.
Download PostgreSQL 9.7 (im not 100% sure on newer verstions).

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'  
wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add - 
sudo apt-get update -y
sudo apt-get install postgresql-9.6 -y

Now install PostGIS

sudo apt-get install postgis

I recommend installing pgadmin as its a great tool for debugging PostgreSQL databases.

sudo apt-get install pgadmin4

Fire it up by running the “pgadmin4” command. It will open up a tab in a browser for you.

Complete the following steps to get the database all set up before you follow the browser wizard to get pgadmin to connect.

Create a user account that isnt just the default. This will also allocate the create database permission

sudo -u postgres createuser ashley --createdb

Log in as the postgres user which will have been created during installation and update the user password.

sudo -u postgres psql
ALTER USER ashley WITH PASSWORD 'new_password';

Once you have a user with a password, you can log in with pgadmin.

Now next we need to download some map data. The obvious source is OpenStreetMap.
This got a little bit confusing for me as there is a lot of different ways to get the data.
I didnt want the whole world, or even a country, i just really want a county – Berkshire.
If its a very small area (small town) you can use openstreetmaps export tool.

If that dosent work, they list a couple of options on the extract tool site. I used the geofabrik website as the area i wanted was too big for the online tool.
Alternatively you can download planet/country but youll need a mega pc use that properly i suspect.

Geofabrik have regular pulls of the data and group it into managable sizes. Here is the data for england:
https://download.geofabrik.de/europe/great-britain/england.html

I used the data for Berkshire – https://download.geofabrik.de/europe/great-britain/england/berkshire.html
Now the data is in a “.osm.pbf” file and there are a number of tools out there to convert this data into a format for PostgreSQl.
Ill save you some time – use imposm3 – https://github.com/omniscale/imposm3/releases

Download imposm3, extract the contents and add the executable to your path.

Next, create two databases. One called “gis” which will hold all of the OpenStreetMap data and another called “natural_earth” which will hold the vector graphics for actually rendering the data.

createdb gis
createdb natural_earth

Now if PostGIS is installed properly, you will be able to add some extensions to these databases.

sudo -u postgres psql -d gis -c 'CREATE EXTENSION postgis; CREATE EXTENSION hstore;'
sudo -u postgres psql -d natural_earth -c 'CREATE EXTENSION postgis; CREATE EXTENSION hstore;'

You may have to run the “CREATE EXTENSION” commands individually for each database.

Next, download this imposm3.json file and save it as mapping.json – https://raw.githubusercontent.com/go-spatial/tegola-osm/master/imposm3.json

Import the OSM data into the gis database. Make sure that the `postgis` and `hstore` extensions are enabled on the `gis` database.

imposm import -connection postgis://ashley:ashley@127.0.0.1/gis -mapping mapping.json -read berkshire-latest.osm.pbf -write
imposm import -connection postgis://ashley:ashley@127.0.0.1/gis -mapping mapping.json -deployproduction

If you have trouble with the above, you can try using osm2pgsl – https://github.com/openstreetmap/osm2pgsql#usage
Download a tegola executable – https://github.com/go-spatial/tegola/releases
Pop it in a suitable location (e.g. /home/ashley/Desktop/tile_stuff)

Install the following prereqs

sudo apt-get install -y curl
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get install -y gdal-bin

Download and run this script – https://raw.githubusercontent.com/go-spatial/tegola-osm/master/osm_land.sh

Download postgis_helpers.sql and install

wget https://raw.githubusercontent.com/go-spatial/tegola-osm/master/postgis_helpers.sql
psql -U ashley -d gis -a -f postgis_helpers.sql

Download postgis_index.sql and install

wget https://raw.githubusercontent.com/go-spatial/tegola-osm/master/postgis_index.sql
psql -U ashley -d gis -a -f postgis_index.sql

Download https://raw.githubusercontent.com/go-spatial/tegola-osm/master/natural_earth.sh and run it. It will download the natural_earth vectors and put them in the `natural_earth` database, so that stuff actually renders. It takes a little while to download and install.

Next we need to create a .toml file which tegola will use. Use the following .toml file which has some of the layers removed. The original can be found here https://raw.githubusercontent.com/go-spatial/tegola-osm/master/tegola.toml
Obviously change the two credential blocks to match your own credentials. Save the file as “config.toml”. In this config file ive stripped out the amenity and other data that isnt really that interesting. You can use the original file as a reference for other provider data.

[webserver]
port =  ":8080"

[[providers]]
name = "osm"
type = "postgis"
host = "localhost"
port = 5432
database = "gis"
user = "ashley"
password = "ashley"

	[[providers.layers]]
	name = "land_8-20"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid FROM land_polygons WHERE wkb_geometry && !BBOX!"

	# Water
	[[providers.layers]]
	name = "water_areas"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank', 'dock') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen0 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 1000000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen0_6"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen0 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 100000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_areas_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type, area FROM osm_water_areas_gen1 WHERE type IN ('water', 'pond', 'basin', 'canal', 'mill_pond', 'riverbank') AND area > 1000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines WHERE type IN ('river', 'canal', 'stream', 'ditch', 'drain', 'dam') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines_gen0 WHERE type IN ('river', 'canal') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "water_lines_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, type FROM osm_water_lines_gen1 WHERE type IN ('river', 'canal', 'stream', 'ditch', 'drain', 'dam') AND geometry && !BBOX!"

	# Land Use
	[[providers.layers]]
	name = "landuse_areas"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas WHERE geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen0 WHERE type IN ('forest','wood','nature reserve', 'nature_reserve', 'military') AND area > 1000000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen0_6"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen0 WHERE type IN ('forest','wood','nature reserve', 'nature_reserve', 'military') AND area > 100000000 AND geometry && !BBOX!"

	[[providers.layers]]
	name = "landuse_areas_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, class, type, area FROM osm_landuse_areas_gen1 WHERE geometry && !BBOX!"

	# Transport

	[[providers.layers]]
	name = "transport_lines_gen0"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, type, tunnel, bridge, ref FROM osm_transport_lines_gen0 WHERE type IN ('motorway','trunk','motorway_link','trunk_link','primary') AND tunnel = 0 AND bridge = 0  AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_gen1"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, ref, class, type FROM osm_transport_lines_gen1 WHERE type IN ('motorway', 'trunk', 'primary', 'primary_link', 'secondary', 'motorway_link', 'trunk_link') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_11-12"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE type IN ('motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary', 'primary_link', 'secondary', 'secondary_link', 'tertiary', 'tertiary_link', 'rail', 'taxiway', 'runway', 'apron') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_13"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE type IN ('motorway', 'motorway_link', 'trunk', 'trunk_link', 'primary', 'primary_link', 'secondary', 'secondary_link', 'tertiary', 'tertiary_link', 'rail', 'residential', 'taxiway', 'runway', 'apron') AND geometry && !BBOX!"

	[[providers.layers]]
	name = "transport_lines_14-20"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, ref, class, type, tunnel, bridge, access, service FROM osm_transport_lines WHERE geometry && !BBOX!"

	# Buildings
	[[providers.layers]]
	name = "buildings"
	geometry_fieldname = "geometry"
	id_fieldname = "osm_id"
	sql = "SELECT ST_AsBinary(geometry) AS geometry, osm_id, name, nullif(as_numeric(height),-1) AS height, type FROM osm_buildings WHERE geometry && !BBOX!"

#	Natural Earth
[[providers]]
name = "ne"
type = "postgis"
host = "localhost"
port = 5432
database = "natural_earth"
user = "ashley"
password = "ashley"

	[[providers.layers]]
	name = "ne_110m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_110m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_50m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_50m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_land"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_10m_land WHERE wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_roads_3"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, min_zoom, min_label, type, label FROM ne_10m_roads WHERE min_zoom < 5 AND type <> 'Ferry Route' AND wkb_geometry && !BBOX!"

	[[providers.layers]]
	name = "ne_10m_roads_5"
	geometry_fieldname = "geometry"
	id_fieldname = "ogc_fid"
	sql = "SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, min_zoom, min_label, type, label FROM ne_10m_roads WHERE min_zoom <= 7  AND type <> 'Ferry Route' AND wkb_geometry && !BBOX!"


[[maps]]
name = "osm"
attribution = "OpenStreetMap" 
center = [-0.98403, 51.44507, 13.0] #Reading

	# Land Polygons
	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_110m_land"
	min_zoom = 0
	max_zoom = 2

	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_50m_land"
	min_zoom = 3
	max_zoom = 4

	[[maps.layers]]
	name = "land"
	provider_layer = "ne.ne_10m_land"
	min_zoom = 5
	max_zoom = 7

	[[maps.layers]]
	name = "land"
	provider_layer = "osm.land_8-20"
	dont_simplify = true
	min_zoom = 8
	max_zoom = 20

	# Land Use
	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen0"
	min_zoom = 3
	max_zoom = 5

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen0_6"
	min_zoom = 6
	max_zoom = 9

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas_gen1"
	min_zoom = 10
	max_zoom = 12

	[[maps.layers]]
	name = "landuse_areas"
	provider_layer = "osm.landuse_areas"
	min_zoom = 13
	max_zoom = 20

	# Water Areas
	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen0"
	min_zoom = 3
	max_zoom = 5

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen0_6"
	min_zoom = 6
	max_zoom = 9

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas_gen1"
	min_zoom = 10
	max_zoom = 12

	[[maps.layers]]
	name = "water_areas"
	provider_layer = "osm.water_areas"
	min_zoom = 13
	max_zoom = 20

	# Water Lines
	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines_gen0"
	min_zoom = 8
	max_zoom = 12

	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines_gen1"
	min_zoom = 13
	max_zoom = 14

	[[maps.layers]]
	name = "water_lines"
	provider_layer = "osm.water_lines"
	min_zoom = 15
	max_zoom = 20

	# Transport Lines (Roads, Rail, Aviation)
	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "ne.ne_10m_roads_3"
	min_zoom = 3
	max_zoom = 4

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "ne.ne_10m_roads_5"
	min_zoom = 5
	max_zoom = 6

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_gen0"
	min_zoom = 7
	max_zoom = 8

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_gen1"
	min_zoom = 9
	max_zoom = 10

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_11-12"
	min_zoom = 11
	max_zoom = 12

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_13"
	min_zoom = 13
	max_zoom = 13

	[[maps.layers]]
	name = "transport_lines"
	provider_layer = "osm.transport_lines_14-20"
	min_zoom = 14
	max_zoom = 20

	# Buildings
	[[maps.layers]]
	name = "buildings"
	provider_layer = "osm.buildings"
	min_zoom = 14
	max_zoom = 20

Then start the tegola server up with:

./tegola serve --config=config.toml

If you navigate to localhost:8080 in a browser, you will see Reading being rendered.

Note – you wont ahave the example layer, thats coming now….

So PostGIS uses “geometry” data rather than using just latitude and longitude. So any data you want to plot you’ll need to have its “geometry” reference. However, there is a nifty way that you can derive the geometry data from lat/long coordinates.

The aim here is to create a database and table, figure out the schema, manually insert some data and figure out the .toml config to get it to render.
Firstly, create a database. Im calling mine “public_data” which is a reference to a future project.

createdb public_data

Add the extensions to this database

sudo -u postgres psql -d public_data -c 'CREATE EXTENSION postgis; CREATE EXTENSION hstore;'

Log into the `public_data` database and create a table.

sudo -u postgres psql public_data

Copy this SQL to create a table called “example”. Notice that the geom field is of type geometry which has been made available to PostgreSQL via the PostGIS extension. The 4326 dictates that the data we are using is in OpenStreetMap format.

CREATE TABLE example ( 
  id  INTEGER,
  name VARCHAR,
  latitude FLOAT,
  longitude FLOAT,
  geom geometry(Point, 4326),
  tags HSTORE,
  CONSTRAINT example_pky PRIMARY KEY (id)
);

Make sure that the permissions are set. A good test to check that it is working is to View data (not just columns) in pgadmin

GRANT CONNECT ON DATABASE public_data TO ashley;
GRANT ALL PRIVILEGES ON TABLE example TO ashley;

Now insert a record with some lat and long data

INSERT INTO example(id, name, latitude, longitude, tags) VALUES (1, 'reading station', 51.455893, -0.971474, '"name" => "reading station"');

Now here is the magic, once your data is in your table, you can figure out the geometry by using a number of PostGIS function. use the following command to update the geom field for all rows in the example table.

UPDATE example SET geom = ST_MakeValid(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326));

The MakeValid function may not be entirely necessary, but i figured its not doing any harm. The MakePoint can be replaced with similar functions for making lines or polygons rather than just a dump point.

Now here is a little terminal dump, which shows the data being created.Note this was taken before i had inserted htype data but you get the idea.

public_data=# INSERT INTO example(id, name, latitude, longitude) VALUES (1, 'reading station', 51.455893, -0.971474);
INSERT 0 1
public_data=# SELECT * FROM example;
 id |      name       | latitude  | longitude | geom 
----+-----------------+-----------+-----------+------
  1 | reading station | 51.455893 | -0.971474 | 
(1 row)

public_data=# UPDATE example SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);
UPDATE 1
public_data=# SELECT * FROM example;
 id |      name       | latitude  | longitude |                        geom                        
----+-----------------+-----------+-----------+----------------------------------------------------
  1 | reading station | 51.455893 | -0.971474 | 0101000020E6100000FA415DA45016EFBFD8BCAAB35ABA4940
(1 row)

Now i have the geometry value populated, ill see if i can get it rendered on the regola server.
Add the following as a provider to the `config.toml` file

# custom provider
[[providers]]
name = "example"
type = "postgis"
host = "localhost"
port = 5432
database = "public_data"
user = "ashley"
password = "ashley"

	[[providers.layers]]
	name = "example_layer"
        geometry_type = "Point"
	id_fieldname = "id"
        srid = 4326
	sql = "SELECT ST_AsBinary(geom) AS geom, name, id FROM example WHERE geom && !BBOX!"

Then add a layer. Note – i spent a couple of hours trying to figure out why my point wouldnt appear, turns out it had, i just couldnt see it behind the transport lines.So put thislayer definition last.

#my test point
[[maps.layers]]
name = "example"
provider_layer = "example.example_layer"
min_zoom = 14
max_zoom = 20

Stop the existing tegola process and fire another up with the updated config.
You may find it easier to disable the transport lines.

You’ll notice that the minimum zoom for this point is 14. So the point wont render unless you zoom in to more than level 14. You can see what level you are currently at by looking in the URL.

Here is a GIF which shows the server in action and also shows my custom data point at Reading station. You can see in the URL that the point dosent get rendered unless i zoom in past level 14 – the minimal zoom level set in the layer config. In the GIF, i click on the Inspect Features button and navigate the cursor over my point. An overlay displays the “name” field i set for the point. Its also shows other data associated with that geometry point.


The Tegola server is meant to be just that – the server. If you want customisation you use clients that use the tegola backend. Their website shows an example using OpenLayers but the javascript sources are outdated and need to be updated to https://cdn.jsdelivr.net/gh/openlayers/openlayers.github.io@master/en/v6.3.1/build/ol.js. However, ive not really played around with that in detail as for now, the server rendering gives me what i want.

Debug Notes
## Debug

To log into postgres

sudo -u postgres psql postgres

Then you can SQL commands. To update a users role:

ALTER ROLE ashley WITH CREATEDB;

to delete all from a table

truncate table example;
Standard
Development, Play, Python

Twitterbot – Making Millions $$$$

So three years ago, i made a little program and I’ve decided to finally do a little writeup on it.

Im a big football fan and i love to place a little bet. Every January and summer im constantly refreshing sky sports to see what players are going where. One summer, i noticed that you can bet on what team you think a player will transfer too. This sparked a little idea to i got to work realising it.

My idea was simple – beat the bookies.
My application firstly went to William Hill and went onto the page where you can bet on a player to transfer to a specific team. It grabbed the HTML and parsed the players names and the options that the bookies would allow you to bet on. You cant bet on anything – only topical players and “likely” teams. You can exactly bet on Shearer to go to Bromley.
Once i had a list of all the player names and their available options to bet on, i combined the player names with a set of key words which i had stored in a text file.

So lets say we had “rooney”, i would combine the string “rooney” in combination with a number of my key words. For example, “rooney spotted”, “rooney likely”, “rooney seen”, “rooney go” and so on. I would do this for every player and would then search the twitter stream for somebody tweating about this magic combination.

The thought process is that twitter is a such an amazing source of information. I thought that I could beat the bookies by getting information from first hand witnesses (fans, journalists, agents etc…). So if a fan spotted rooney at newcastle airport, there would be a good chance that rooney was on his way to sign for newcastle. This app was being ran during the transfer windows, so its unlikely that rooney would be in newcastle for a visit.

Now the bookies dont exactly have an API that allows you to place a bet. Why would they do that? People would just use it to undercut them. Betfair does have something but you have to paay a couple of hundred pounds to gain access to it. It wasnt worth it for my little project.
So what i did was i used selenium (primarily an website testing automation tool) to fire up a programmable browser, navigate to the betting page and automatically log in, find the rooney and newcastle option on the page and place a 5p bet.

My first iteration of this app didnt go quite to plan. I put £10 credit into my account, left the app running over night and when i checked it in the morning i had £0 left in my account!. It turns out i had completely forgotten about retweets – derp!

I wont post the source code, it was just strung together and it has my twitter API key hardcoded. Plus its 3 years old so i doubt it works anymore.

However, here is a video of this bad boy in action:

In that video, i dont put any credit in my account but you can see that the actual process is entirely automated – i dont move my cursor once the bet is placed. Unfortunately this was 3 years ago that I made this, so i dont have any better footage. I ended up running it on a raspberry pi for a couple of weeks.

I think that there is some serious merit in this idea, i imagine that the bookies have something similar in place for deciding their odds. If they dont, they really should. I think that if i put some serious effort into this app it could actually work quite nicely. You could even narrow down the sources to be from reliable sources.

On a side note – i remember when i was testing this app i would personally tweet things like “rooney to newcastle” to test my API. as a result my twitter feed was just a wash of the same “rooney to newcastle” tweet (or subtle variations). I then got a notification that the CEO of the company i work at had followed me and i was just tweeting “rooney to newcastle” all the time – oops.

Obviously this wasnt a serious attempt to make money, it was just a bit of fun exploring an idea. I think i probably lost £20 in total, because it wasnt really fine tuned. The app would just bet on everything, so in the end i had bets for a player to go to nearly every combination. You cant make money if you bet on every possible option haha!

Standard
Development, Python, Security

Package Management – An overlooked security vulnerability?

As a millennial who develops software, I use other peoples libraries whenever I can to get the job done as quickly as possible. Unless its something particularly bespoke, there is always a library out there which does what i want to do, written better than i ever good. Its by harnessing the collaborative and open source communities that enables technology to advance so quickly.
Often when developing software professionally, you are under time constraints. It seems to be very rare to get given time to learn, design, and develop software and often its expected to just get the job done as quickly as possible.

Companies that develop software are likely to employ information assurance personnel who (in my experience) do not have development experience. They may hold CISSP, but it really just teaches the basics and i dont believe it is enough to make a competant cyber professional. The thought process is often “create a policy that blocks TCP 3389 and 22”, glossing over the actual security concerns. Policy needs to be “block remote access from public sources” and not touch specific technologies. A port can be used for anything – just because a spec says that 3389 is reserved for RDP dosent mean that it is. Its trivial to proxy ports or find workarounds (guacamole?). Whats the answer? Block everything? Monitor evertyhing? Zero trust?
In a similar vein, assurance personal may dictate the use of internal repository mirrors, under the illusion that it brings any security whatsoever. I strongly believe that security needs to come from the bottom up – contrary to popular believe. I know the security processes are bad, because I know how i’d get around them.

As i was developing some Python code, i realised that I really dont have a clue whats actually in the libraries i’m using. From a due dilligence perspective, it dosent matter. Im using libraries from approved internal mirrors. But i dont control the mirrors and i dont control the libraries.
As an experiment, i thought i would create a Python package called ‘johndoe’ and upload it to PyPi. The thought process is that anyone (John Doe) and upload any python code onto PyPi. They dont check whats being uploaded and they wont. Its not in their remit – they are a platform enabling collaboration.

I havent updated the package since i made it last year. The johndoe package dosent do much, but it dosent really have to. Once the package is installed (and used) it can do pretty much anything. It could literally be ransomwhere. The whole im trying to make isnt to upload a malicious package – plenty of people have already done that by uploading packages with slightly mispelt names to popular packages (although i do think PyPi have cracked down on that). All it takes is a malicious actor to solve a problem on stackoverflow using a dodgy package and the whole company could be ruined.

johndoe itself isnt really malicious, all it does it print out some basic group policy information. There are plenty of methods of getting data out of a network. The obvious is just send to a website, even if you have ingress/egress traffic montioring and encryption, there may be ways around that. A method that i found really interesting was data exfiltration via DNS – https://dejandayoff.com/using-dns-to-break-out-of-isolated-networks-in-a-aws-cloud-environment/
Getting information out is the easy bit, its getting access to the systems in the first place which is “difficult”. Poor package management is one way that seems to have slipped through a lot of security policies.

This problem is global and highlights the need for security design in development environments. This problem isnt Python specific by any means. The answer is NOT “block all packages”, they are pivotal to modern software development.
When you next get into your “secure” development environment, try and install johndoe. I’d bet that a majority of people have no problems downloading that package. Even if your policy is to download onto an air gapped machine and virus scan the software, its not going to find anything.

Standard
Miscellaneous, Play

Game Changing Shell

A colleague of mine recently introduced me to the game changing ohmyzsh shell.
Before:

After:

The shell has tonnes of plugins available built in, as well as loads of community driven plugins. The aesthetic themes make long hours much easier and the plugins can really improve your efficiency at the terminal.

It was a little fiddely to set up, so here is a little semi-automated writeup on how to get started.

Terminator is by far my favourite terminal. Install it via apt along with zsh and git (required for setup)

sudo apt-get install zsh terminator git -y

Download ohmyzsh

sh -c "$(wget -O- https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

When the script has finished, just type `exit` to revert to bash.

If you need to set your default shell to zsh, do the following:

sudo chsh -s /bin/zsh $(whoami)

Download & install the meslo nerd font. This include loads of icons for stuff like debian/ubuntu, gitlab, github.

mkdir ~/.fonts
wget https://github.com/ryanoasis/nerd-fonts/releases/download/v2.1.0/Meslo.zip -O temp.zip
unzip temp.zip -d ~/.fonts/
rm temp.zip
fc-cache -f -v

Download & Install the zsh-syntax-highlighting plugin

wget https://github.com/zsh-users/zsh-syntax-highlighting/archive/master.zip -O temp.zip
unzip temp.zip
mv zsh-syntax-highlighting-master /home/$(whoami)/.oh-my-zsh/plugins/zsh-syntax-highlighting
rm temp.zip

Install powerlevel10k

git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ~/.oh-my-zsh/themes/powerlevel10k

Edit ~/.zshrc and add the following

ZSH_THEME="powerlevel10k/powerlevel10k"
POWERLEVEL10K_MODE="nerdfont-complete"

Replace the plugins section (line ~72) with the following. add more plugins as necessary

plugins=(git dnf zsh-syntax-highlighting)

Now close the terminal and open up terminator.
You’ll be asked a bunch of questions on customising the terminal.

The only thing i recommend is to use the “rainbow” layout and to use unicode and not ascii.

If you ever want to re-run the prompt to change the layout, run the following command from anywhere:

p10k configure

I use Visual Studio code a lot, so ive change the terminator colour to the same as VSCodes default theme. Its #1e1e1e
Right click on the screen of terminator and go to preferences.
Change to the profiles tab
Under General, change the font to MesloLGSDZ Nerd Font Regular
Under the colours tab, change the built in schemes to custom and change the colour.

If you type in an invalid command – for example “sl” instead of “ls” the command is highlighted red instead of green. This is actually an incredibley useful feature, especially when you are developing and have a dozen different aliases set.

If you clone a git repo, you can see at a glance what branch youre on. This is also really useful if you change into some really old clone that you forgot about.

git clone https://github.com/postmanlabs/httpbin.git


On top of this, if you run git and git tab, itll show you all the possible command variants – very useful (much more than any –help suffix)

If you want to integrate the shell into VSCode, open it up, press F1 and type `Preferences : Open Settings (JSON).
Then in the settings, append the following at the bottom.

    "terminal.integrated.fontFamily": "MesloLGS NF" ,
    "terminal.integrated.shell.linux": "/bin/zsh",
    "terminal.integrated.rendererType": "canvas",
    "terminal.integrated.lineHeight": 1.3

This will add in the icons and the font onto your VSCode Terminal

There are tonnes of plugins out there and im really just getting started, but im fully converted to zsh.

Standard
Development

MicroK8s Basic Deployment

I’m trying to learn Kubernetes so I am playing around with MicroK8s to run a deployment locally on my Ubuntu Virtual Machine.

This post describes a basic deployment of HTTPBin in a cluster of three pods, with external access to the pods. Im very new to kubernetes, so this is a basic post where ive been feeling my way to try and gain a better understanding. I hope that my simple explanation (as im not yet capable of explaining it in detail) help others as i was overwhelmed with information on the subject.

My current understanding is that each pod runs a container (docker). Pods are mortal and may fall over. As they are running containers, it dosent matter. The benefit of running things in containers is that each instance is self contained, so any dependencies and configurations are the same across multiple instances. A pod has its own private IP address within a cluster (a group of pods). A pod is monitored and if it falls over, it can be restarted. Pods within a cluster can talk to each other. Service discovery can be handled by kubernetes or other add-ons such as istio, which may add security measures like side car proxies.

A Deployment is a number of Pods that run the same services. So in my example, i’ll create a deployment of the HTTPBin which will fire up 3 pods each running the HTTPBin docker container. Access to the deployment (the logical grouping of the pods) will be via a LoadBalancer. The LoadBalancer is a service which provides a public IP address. It seems that there are numerous says of providing outside access to a deployment/pod/cluster – see https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types

It seems that a NodePort isnt necessarily ideal for production, so i figure i should use a local load balancer as the steps to moving from development (MicroK8s) to production (AKS or similar) should be minimal (i hope).

Luckily, MicroK8s has a load balancer add-on called “metallb” which replicates an external load balancer. First thing to do is to get your local interface IP address. Im still using the older Ubuntu 16.04, so im using ifconfig.

My IP address is 192.168.0.42 with a /24 subnet mask. When installing metallb you’ll need to provide a range of IP addresses that it can use to provide outside access into whatever kubernetes things you configure it for. In my example. i’ll provide the ip range 192.168.0.50-192.168.0.60.

Install MicroK8s

sudo apt-get install snap
snap install microk8s --classic

Enable the metallb addon

 sudo microk8s enable metallb 

When prompted, add in the appropriate IP range. These IPs must not be in use and must be on the same network as your interface.

To use the kubectl command in MicroK8s, the full command is `sudo microk8s.kubectl” which is a bit of a mouthful, so i like to set an alias

alias kubectl="sudo microk8s.kubectl"

Aliases are only set for that terminal session, so if you close the terminal or restart the machine you’ll lose the alias. So if you have multiple different endpoints (like on a website) you may want to add an Ingress to explicity define routes to the services. However, this example is so basic that all you need is two configuration items: the deployment of pods and the LoadBalancer service. You can put them all in a yaml file (e.g test-app.yaml) that looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app-deployment
  labels:
    app: test-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: test-app
        image: kennethreitz/httpbin
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-new
spec:
  selector:
    app: test-app
  ports:
    - protocol: TCP
      port: 80
  type: LoadBalancer

So firstly we are defining a Deployment. A deployment is a grouping of pods serving the same thing (httpbin container). You can see under the spec that its referencing the public httpbin docker container image. This “deployment” is called “test-app-deployment”, each pod within the deployment will be running a container called “test-app”. The second half of the file describes the service. The “selector” section of the service describes what the service will point to. in this case its the deployment as a whole rather than an individual pod. The type is of LoadBalancer and not NodePort as i want this to be as close to production as possible. The LoadBalancer service will use metallb and will set the external IP address of the service to one within my defined range. Apply the file with the following:

kubectl apply -f test-app.yaml

Test that the service now has an external IP address

kubectl get svc


Test that the service can connect to the deployment and that each instance of the httpbin is presenting a valid endpoint

kubectl get endpoints

Online, many people create their services using the command line which i find a goofy concept. Ive seen people create NodePort services with the following:

sudo microk8s.kubectl expose deployment test-app-deployment --type=NodePort --port=80 --name=ak-service

I really dont like this approach as i feel the infrastructure should be described in code to be repeatable.

From you main machine, try a HTTP GET to the service external-ip, which will fall onto one of the pods in the deployment. I believe by default the LoadBalancer implements a round robin approach.

curl -X GET "http://192.168.0.50/get" -H  "accept: application/json"

Alternatively, if you really dont want to use a LoadBalancer service, you can expose the service via an Ingress. With an ingress, each rule much map to a service. On the Ingress Controller itself (not defined in any of your .yaml files) you set the default backend – where any requests that dont match your rules will do. I think that there is limited functionality with the MicroK8s ingress – its meant for local development. So the IP address that your ingress will get will be 127.0.0.1 (as opposed to 192.168.0.50 with the load balancer)

Make sure ingress is enabled
sudo microk8s.enable ingress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app-deployment
  labels:
    app: test-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: test-app
        image: kennethreitz/httpbin
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-new
spec:
  selector:
    app: test-app
  ports:
    - protocol: TCP
      port: 80
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: test-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - http:
      paths:
      #each path must have a corresponding service
      - path: /
        pathType: Prefix
        backend:
          serviceName: my-service-new
          servicePort: 80
The above snippet shows that we have the same deployment, the service references the deployment and the ingress calls the service. Notice how the “type=loadbalancer” is removed from the service.
sudo microk8s.enable ingress
kubectl apply -f test-app.yaml
Give it a short while for your ingress to get an IP (localhost) and then you should be able to try a curl from your VM to http://127.0.0.1/get
Standard
Miscellaneous, Security

CISSP Reflection

I have just passed my CISSP exam (28th January). It took me around 2 hours and I had to complete the full 150 questions.
My main takeaway from the exam is that the exam is much harder than any of the test questions.
Without exaggeration, I would say that about 70% of what I studied did not appear in the exam. Often when studying, i could make educated
guesses at some of the options, but in the actual exam the answers were so similar that it was often a complete gamble.
That being said, 70% of 150 questions means that you can get 45 questions wrong and still pass, which is quite a lot.

I cant talk too much about the exam as I signed an NDA, so i’ll abstract it.
I read the student guide cover to cover, then I did the practice questions in the study guide. I then did all the mock exams online and was
averaging about 80%. I was pretty confident going into the exam but as soon as I started, I instantly knew I was going to struggle. After the
first 10 questions I did not feel that I was going to pass. Im usually pretty confident about most things and but I instantly struggled. I would
say that I only got about 3 or 4 questions that were “easy” and the rest were difficult. In the practise questions, you can usually rule out a
couple of answers, but in the exam I really struggled to do that. The answers were very similar and it was much more difficult to have
perspective for what they want the answer to be.

Coming from a technical background I really struggled with some of the questions and I definitely overthought some of the answers. I found
that when doing the mock questions, you could easily argue against many of their answers, but in the exam it was much harder to do that.
There is an awful lot of information about CISSP online and I made sure to only stick to official resources.
The CISSP is regarded as non-technical, but as I was studying I was really starting to doubt that. I thought I was becoming more of a
memory exercise – regurgitating encryption block sizes and information security models (bell lapadula, clark Wilson etc…). I had memorised
a lot of the information and that’s what gave me a lot of confidence. As ive said, that confidence was ill founded and none of those questions
appeared for me. I felt that the exam itself was not representative of the study material. The exam was much less technical and more process
/policy orientated which I what I initially thought the CISSP would be like. Its almost like the study material mislead me into thinking it was
more technical than it actually was.
The exam uses this variable learning algorithm to target areas you are weaker at, but it must have been really good because I don’t recall
getting many questions about crypto/network/application stuff, which is defiantly my stronger areas.
The exam is much more abstract and assessor focused than any of the mock tests lead me to belive. For example, its not enough to just know what the various types access control are, I never had a question like that. They give you a kinda complicated scenario and ask which one would be best. I don’t recall having questions like that in the mocks. Nothing as simple as “what type of access control uses labels”, it would be something more like “company x wants to implement 4 different types of user controls, with so and so having this access, department x having this access” kind of thing. Those types of questions were actually some of the easier ones

Learning

I attended the bootcamp in July and I didn’t get much out of it (except for the voucher!) It reminded me that I do not learn as well in instructor
based sessions. I tend to learn but just locking myself away and trawling through documentation and handwriting notes at my own pace.
That’s what works for me, but others have a lot of success with video and classroom based lessons.
After the bootcamp, I didn’t do much studying at all. It got to around Christmas time and I decided I needed to book the exam otherwise I
would never do it. I booked the exam for Jan 28th, thinking it was plenty of time. My exam was at 3pm, in hindsight I wish I booked an AM
slot. I’m much less active in the afternoons as I normally start work around 6.45. I was also sort of floating around in the morning, not wanting
to do last minute study, but not wanting to loose focus; so it was just kind of dead time. It took me 6 months to make it through the student
guide that was provided in the bootcamp. The majority of that was completed in December. So I spent December reading the material and
then January was more revision even though that’s when I learnt the most. I wish i hadnt read the student guide and had read the study
guide instead. I never read the study guide end to end, but it is undoubtedly much more comprehensive. A lot of the material in the mock
exams isnt covered in the student guide.
Prior to the bootcamp, I went and got the CISSP study guide. I wasn’t aware that we were going to be given any physical materials in the bootcamp. It turns out that this was one of the best things I did, as the student guide
is not good enough for this exam. The only benefit it has is that it contains additional test questions. The book itself does not cover a lot of the material in the exam.

Student Guide – BAD
Study Guide – GOOD

I would highly recommend not bothering at all with the student guide and read the study guide instead. The study guide comes with more than 1300 questions. Although the questions that are in the book at the end of each chapter are also in the online questions

Revision

Looking online, people spend months and months preparing for the exam. I think I was a little over confident in my ability and I would say
that I only started properly studying about 2/3 weeks before the exam.
I did the mock exams at the end of each chapter in the study guide and the student guide and then revised each wrong answer. If there was
an option that I didn’t know what it was, I went away and studied that option. This is a good technique, because if the same option pops up
elsewhere, you’ll know what it is and potentially select it or rule it out.
In the two weeks leading up to the exam, I spend an hour every weekday morning going through the mock exams online – pausing and
going through the answers. And then the last two weekends before the exam, I spent 3 hours each day studying. So in the last 2 weeks, i
must have done around 22 hours. The time before that was just bits and bobs (with the exclusion of reading the student guide).

Exam Process

I booked my exam for 3pm, but me being me I arrived way to early. I got there around 1.45 but they were able to put me straight through, so
I did my exam early.
I brought some water with me, but you’re not allowed to take anything in. This was annoying because it was quite hot in there. You have to
leave everything (bar your two forms of ID) in a locker.
The room was really hot and full of people taking their driving theory. I found it quite distracting with people walking in and out, the voices in
the room outside and the heat. Its quiet, but not silent. They provide earplugs but I didn’t use them. I remember about 30 minutes in the
exam wishing i was taking the driving theory again rather than this exam haha!
The exam is 150 questions, but everyone ive spoken to seems to have finished around 110 questions. After the first handful of questions my
confidence vanished because the questions were on a different level to everything I had studied.
Obviously I knew I hadn’t failed already, but I was not feeling great. As the exam approached the 100 question mark, I was pretty sure I had
failed. Honestly, a lot of the questions I had to guess. I was getting very frustrated at some of the questions because the answers were so
similar. Much more so than in any mock/practise questions I had done. Regardless, I decided I would try and get as much out the experience
as possible. As the company were paying, I really had nothing to lose.
I purposely didn’t keep it a secret that I had my CISSP exam. I think its important to be open and not embarrassed of failure. Failure isnt
something to be embarrassed or scared of. I have enough confidence in myself to know that if I failed the exam, it doesn’t mean im an
idiot. I honestly feel that I was extremely lucky to pass. Having completed the process, its not something I am keep to repeat. I invested a lot
of time learning things I will never need. It not that I don’t respect the CISSP certification – it was very difficult. I just honestly do not believe that I am a better cyber individual after completing the exam.

Standard