Looking at the Dataset from Awesome Italia Remote

Some very kind and skilled Italian guys are behind a website to list companies that hire Italian workers remotely. Their super useful scouting work can be consulted freely and conveniently at italiaremote.com, but being a lovely bunch other than kind and skilled they also shared the data in a publicly available github repository.

Let's have a look at it!

The data has been fetched from the repository on 2024.04.09 and might have changed both in content and shape in the meantime but you can find the notebook I used to explore the data with the typical do-data-stuff-in-python tools in this repository: FrancescoManfredi/AIRV-analysis.

What does the data look like?

The dataset is provided as a set of json files. Each json file contains data for a single company and for each company we have the following fields (here reported with their data type).

id	columns	type	inner_type
0	name	str	-
1	url	str	-
2	career_page_url	str	-
3	type	str	-
4	categories	list	str
5	remote_policy	str	-
6	hiring_policies	list	str
7	tags	list	str

Here is what one of these files look like:

{
  "name": "Canonical",
  "url": "https://canonical.com/",
  "career_page_url": "https://canonical.com/careers",
  "type": "Product",
  "categories": [
    "cloud_software"
  ],
  "remote_policy": "Full",
  "hiring_policies": [
    "Contract"
  ],
  "tags": [
    "Python",
    "Go",
    "OpenStack",
    "Kubernetes"
  ]
}

Total companies and duplicates

metric	name	career_page_url
count	340	340
unique	339	338

While writing this post there is a total of 340 companies. There were actually 339 unique company names among the 340 files but it came out one of the companies (fastloop) was just named incorrectly.
Also the career_page_url field only shows 338 unique values, which means some companies share the same value. In fact the following two couples of companies have some sort of close relationship and share the same career pages:

id	name	url	career_page_url
146	HACKERSGEN	https://hackersgen.com	https://www.sorint.com/en/careers/
148	Docplanner	https://www.docplanner.com	https://www.docplanner.com/career
155	SORINT.lab	https://www.sorint.com/en/	https://www.sorint.com/en/careers/
225	Gipo - Ianiri Informatica	https://www.gipo.it	https://www.docplanner.com/career

What are the business models of these companies?

The type assigned to a company puts it in one of four categories based on its primary business model or clientele. These could be "Product", "Consulting", "B2B" or "B2C".

company_types

There is a strong prevalence of product companies. Four of the companies do not have a type.
Only one of the companies have more than one: Consulting and B2B.
Being the only one on 430 companies and considering the field is encoded as a string instead of a list, the suspect arises that this might be an error, but on the other hand it makes perfect sense for a company to fall in more of these categories at the same time.
I'll just leave the data alone in this case.

Which remote policies do these companies apply?

The README in the original repository provides the following descriptions for the three values for the remote policy (verbatim):

Full: Company doesn't have physical offices, so you'll always work remotely;
Hybrid: Company allows remote but only for some days;
Optional: Company allows you to choose when work remotely or in office, but can ask you to go sometimes.

The distribution looks like this:

remote_policy

Considering the dataset exists specifically to track companies that allow remote work it is no surprise that "Hybrid", the "least permissive" option, is also the less frequent.

Which tech stacks and skills do these companies need?

There is no explicit description of what the "tags" field for these companies exactly represents. In the official repository the list of tags appear under the "Stack" header in a table. Nonetheless, based on the observed values, it looks like this field is also (somewhat less frequently) used to capture other aspects like skills required or knowledge of some domains (i.e. Big Data, Business Marketing, fintech).

The available tags were extremely non omogeneous: a large number of tags referring to the same concept were often rendered with different casing and overall style. Some normalization work was performed that resulted in a mapping from original to normalized forms (available here as a python dict) in order to conduct a somewhat meaningful analysis.

tags_wordcloud
Yay! Wordclouds!

top25_tags
Something a little more serious

In which sectors do these companies operate?

There is no explicit description for the category field, but based on the available options these represent the sectors in which the company operates. Here is the distribution visualized:

category	count	proportion
design_ux	45	0.132353
cloud_software	298	0.876471
hr	12	0.0352941
cybersecurity	12	0.0352941
marketing_writing	45	0.132353

What are the hiring policies of these companies?

The README in the original repository gives the following descriptions for each possible hiring policy:

Direct: Company is hiring directly with a legal entity in Italy;
Contract: Company is hiring contractors in Italy, VAT Number is required;
Intermediary: Company is hiring using a payroll intermediary in Italy.

In pictures:

hiring_policies

In numbers:

hiring policy	count	proportion
unknown	37	0.108824
Contract	55	0.161765
Intermediary	12	0.0352941
Direct	254	0.747059

A couple curiosities

Now that I've looked at the data I've got a couple of questions:

Are companies in certain sectors (as in category field) more prone to work in multiple sectors? Or another way to put it: is it more common for companies doing marketing and writing to only do marketing and writing? What about cloud & software?
What do you need to work in a cybersecurity company? Can the tags help us forming an idea?

Do companies in certain sectors have a higher tendency to work in multiple sectors?

To answer this question we can perform a chi-square test of independence in which the null hypothesis is that the two variables we are looking for are not correlated and the values are perfectly explainable with random chance.
In other words we are going to perform a statistical test to check, for each sector, if the proportion of companies that have more than one category is high enough that it would be fairly rare to happen by chance. If we find this is the case, we can assume that there is, in fact, some sort of effect going on between working in a sector and working in more than one sector.

First we need to compute a correlation matrix for categories and "having-multiple-category-ness", then we'll feed it into a python function to perform the chi-square test. This function will provide us with a p-value, a statistic and an expected distribution under the null hypothesis.

contingency_vs_expected

The scheme up here shows the difference between what we observed and what we should have observed if there was absolutely no effect between the two observed variables. Another, more visual, way to visualize this discrepancy would be the following:

displacement
Is this more clear? I don't know but looks cool!

The test also gave a p-value = 3.7616354613201274e-26 (basically 0) which would make us reject the null hypothesis of absence of effect for any sensible choice of significance level alpha.

By observing the directions of the displacements with reguard to the expected split we can say that Cloud & Software companies have a tendency to only work in cloud and software, while companies in other categories have a tendency to diversify more. It is worth to mention that this might be the result of "Cloud & Software" being a very broad line of work that could be split in lots of sub-categories while the other four are quite well defined.
A different possible explanation is that the dataset was collected and is maintained by IT professionals so the presence of other categories might have been of little interest for them from the start.

What do you need to work in cybersecurity companies?

To answer this question we can look at the tags for the companies in the cybersecurity category. We will also ignore companies in that category that also appear in a different category as we would not have a way to know why each tag is present.

tags	count
Penetration Testing	2
Hardware Security	2
ACSIA	1
Vulnerability Assessment	1
IT Managed Services	1
Digital Transformation	1
Security Research	1
Application Security	1
Red Teaming	1
Rust	1
Go	1
Solidity	1
Blockchain	1
Cryptography	1
Distributed Systems	1

While seeing this list of tags might be somewhat useful, we cannot say the same for the frequencies. The fact is that by filtering for companies in cybersec alone we ended up with only 4 companies (4Securitas, Infor srl SB , Shielder , Codezen).
Some things come to mind when watching these tags:

Penetration testing was kind of expected to be there;
Hardware security? I did not think about that!
Should Red Teaming and Vulnerability Assessment be considered the same as Penetration Testing? I don't know.
What is ACSIA? Apparently it is an Italian-developed cybersecurity solution.
I would have expected Python to appear at least once for its versatility but instead we see Rust and Go in the list.

It's been a nice little exercise with an interesting dataset.
Here are some links related to this post:

italiaremote.com the website where you can search and consult the data with a convenient and clear web interface;
Awesome Italia Remote the repository for the data by the original authors;
My repository containing the analysis work (see the workbench.ipynb notebook).

For questions or comments feel free to send me an email or other message (see About Me).