# Sample names

**by [Matt Hall](https://github.com/kwinkunks)**

You have a set of sample names. They look like this:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    ^^^^^^ ^^^^^ ^^^^^^^^ ^ ^^^^^^^^^^ ^^^
      1      2      3     4      5      6

A **valid name** consists of 6 parts separated by underscores. The parts are underlined, above. Note that the parts might not be correct or consistent. Having 6 parts, whether they are correct or not, is enough to be called 'valid'. There may be other problems, for example with the spelling or formatting of individual parts, but we will still call it 'valid'.

The 6 parts are:

- **Unique identifier** consisting of 6 characters.
- **Basin name.** Note that spellings are not guaranteed to be correct.
- **Unit or Formation name.** Note that spellings are not guaranteed to be correct.
- **Specimen type**, either H or C (hand or core).
- **Date**, which must be in ISO 8601 YYYY-MM-DD format to be considered correct.
- **Preparation codes** of at least one character.

We need to extract some information from this dataset.

1. How many valid sample names are there?
2. How many valid samples were taken in the Ainsa basin? Include records with misspelt basin names.
3. What's the longest period of days with no valid samples taken in Ainsa?

If looking for misspellings, we'll assume that any word starting and ending in the same letters, but with the middle letters scrambled, is the same word. So 'Anisa' is a misspelling of 'Ainsa', but 'Aimsa' is not. We'll also assume that the spelling with the most occurrences is the correct spelling.


## Example

Here's some sample data:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    001236_Ainsa_Sobrarbe_H_2016-04-21_P
    001237_Anisa_Sobrarbe_H_2016-04-29_TCx
    001238_Sorbas_Gochar_2017-06-03_PxM
    001238_Sorbas_Gochar_C_2017-06-03_PxM
    001240_SORBAS_Gochar_C_2017-06-03_PxM

Let's answer the 3 questions for this sample dataset:

- There are **5** valid names (and 1 invalid one, with no specimen type).
- The Ainsa Basin appears in **3** sample names (including 1 misspelling).
- There is a **7** day period with no samples taken, between 21 April and 29 April.


## Hints

It's likely that the `datetime` library will be useful in answering question 3. In particular, this code is useful:

    from datetime import datetime
    datetime.fromisoformat('2016-07-03')

If that command fails on a date, then you should consider the date format incorrect and ignore that record.


## A quick reminder how this works

This document is formatted in [Markdown](https://daringfireball.net/projects/markdown/).

You can retrieve your data, which is always a string, by choosing a **`<KEY>`** (also a string). This ensures that you have different data from other people, so be creative. 

```
url = 'https://kata.scienxlab.org/challenge/sample-names'
params = {
    'key': <KEY>  # Replace <KEY> with your own string.
}
r = requests.get(url, params)
r.text
```

To answer question 1, change the `params`:

```
params = {
    'key': <KEY>,   # Use the same key you used to get your input.
    'question': 1,
    'answer': 1234  # Your answer; can be a float, int, list or array;
                    # the challenge description will tell you which.
}
```

To get a hint for a question, provide the question number but no answer:

```
params = {
    'question': 1,
}
```

[Complete instructions at kata.scienxlab.org](https://kata.scienxlab.org/challenge)

[An example notebook to get you started](https://gist.github.com/kwinkunks/50f11dac6ab7ff8c3e6c7b34536501a2)

----

© 2024 [Scienxlab](https://scienxlab.org/) &mdash; Code: openly licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) &mdash; Text: openly licensed under [CC BY](https://creativecommons.org/licenses/by/4.0/).